vor 12 Stunden

Inhaltsverzeichnis

Zusammenfassung

Autoregressive große Sprachmodelle (LLMs) erzielen starke Leistung, erfordern jedoch inhärent sequenzielle Decodierung, was zu hoher Inference-Latenz und schlechter GPU-Auslastung führt. Spekulative Decodierung verringert diese Engstelle, indem ein schnelles Draft-Modell verwendet wird, dessen Ausgaben parallel vom Ziel-LLM überprüft werden; bestehende Methoden stützen sich jedoch weiterhin auf autoregressive Drafting, das weiterhin sequenziell ist und praktische Beschleunigungen einschränkt. Diffusions-LLMs bieten eine vielversprechende Alternative durch die Möglichkeit paralleler Generierung, doch leisten aktuelle Diffusionsmodelle im Vergleich zu autoregressiven Modellen typischerweise eine unterdurchschnittliche Leistung. In diesem Paper stellen wir DFlash vor, einen spekulativen Decodierungsrahmen, der ein leichtgewichtiges Block-Diffusionsmodell zur parallelen Drafting nutzt. Durch die Generierung von Draft-Token in einem einzigen Forward-Pass und die Bedingung des Draft-Modells auf Kontextmerkmalen, die vom Zielmodell extrahiert werden, ermöglicht DFlash effizientes Drafting mit hochwertigen Ausgaben und höheren Annahmeraten. Experimente zeigen, dass DFlash über 6-fache verlustfreie Beschleunigung über eine Vielzahl von Modellen und Aufgaben hinweg erzielt und bis zu 2,5-fach höhere Beschleunigung im Vergleich zur aktuellen Spitzenleistung im Bereich spekulativer Decodierung, EAGLE-3, bietet.

One-sentence Summary

Jian Chen, Yesheng Liang, and Zhijian Liu from Z-Lab introduce DFlash, a speculative decoding framework using a lightweight block diffusion model for parallel drafting, achieving over 6× lossless acceleration and outperforming EAGLE-3 by 2.5× via context-conditioned, single-pass generation.

Key Contributions

DFlash introduces a speculative decoding framework that uses a lightweight block diffusion model to draft tokens in parallel, overcoming the sequential bottleneck of autoregressive drafting while maintaining lossless output quality.
It conditions the draft model on context features extracted from the target LLM, enabling high-acceptance-rate drafts through a single forward pass, effectively turning the draft model into a diffusion adapter that leverages the target’s reasoning.
Evaluated across multiple models and benchmarks including Qwen3-8B, DFlash achieves over 6× acceleration and up to 2.5× higher speedup than EAGLE-3, demonstrating superior efficiency under realistic serving conditions.

Introduction

The authors leverage speculative decoding to accelerate autoregressive LLM inference by replacing the traditional sequential draft model with a lightweight block diffusion model. While prior speculative methods like EAGLE-3 still rely on autoregressive drafting—limiting speedups to 2–3×—and diffusion LLMs alone suffer from quality degradation or high step counts, DFlash sidesteps these issues by using the target model’s hidden features to condition the diffusion drafter. This allows parallel generation of high-quality draft blocks in a single forward pass, achieving up to 6× lossless speedup and outperforming EAGLE-3 by 2.5×, while keeping memory and compute overhead low.

Method

The authors leverage a diffusion-based speculative decoding framework called DFlash, which decouples drafting latency from the number of generated tokens by performing parallel block-wise generation. This architecture enables the use of deeper, more expressive draft models without incurring the linear latency penalty typical of autoregressive drafters. The core innovation lies in the tight integration of target model context features into the draft model’s internal state via persistent Key-Value (KV) cache injection, ensuring consistent conditioning across all draft layers.

During inference, the target model first performs a prefill pass to generate the initial token. From this pass, hidden representations are extracted from a stratified set of layers—spanning shallow to deep—and fused through a lightweight projection layer into a compact context feature vector. This fused feature is then injected into the Key and Value projections of every draft layer, where it is stored in the draft model’s KV cache and reused across all drafting iterations. This design ensures that contextual information from the target model remains strong and undiluted throughout the draft network, enabling acceptance length to scale with draft depth. As shown in the framework diagram, the draft model processes a sequence composed of target decode tokens (yellow), mask tokens (green), and fused target context features (blue), with bidirectional attention within each block and conditioning via the injected KV cache.

Training is structured to mirror inference behavior. The draft model is trained to denoise masked blocks conditioned on the frozen target model’s hidden features. Rather than uniformly partitioning the response into blocks, DFlash randomly samples anchor tokens from the response and constructs each block to begin with an anchor, masking the subsequent positions. This mimics the speculative decoding process, where the draft model always conditions on a clean token produced by the target model. All blocks are concatenated into a single sequence and processed jointly using a sparse attention mask that enforces causal consistency: tokens attend bidirectionally within their own block and to the injected target context features, but not across blocks. This allows efficient multi-block training in a single forward pass.

To accelerate convergence and prioritize early token accuracy—which disproportionately affects acceptance length—the authors apply an exponentially decaying loss weight per token position $k$ within a block:

w_k = \exp\left(-\frac{k - 1}{\gamma}\right),

where $\gamma$ controls the decay rate. Additionally, the draft model shares its token embedding and language modeling head with the target model, keeping them frozen during training. This reduces trainable parameters and aligns the draft model’s representation space with the target, effectively turning it into a lightweight diffusion adapter. For long-context training, the number of masked blocks per sequence is fixed, and anchor positions are resampled per epoch, enabling efficient data augmentation without unbounded computational cost.

Experiment

DFlash consistently outperforms EAGLE-3 across math, code, and chat tasks, achieving up to 5.1x speedup over autoregressive decoding while maintaining higher acceptance length with lower verification overhead.
It demonstrates strong efficiency gains in reasoning mode and real-world serving scenarios via SGLang, reducing deployment cost without sacrificing output quality.
Ablation studies confirm that DFlash’s performance is robust to training data choices and benefits from deeper draft models (5–8 layers) and more target hidden features, though optimal layer count balances drafting cost and quality.
Block size significantly impacts performance: models trained on larger blocks generalize well to smaller inference blocks, enabling potential dynamic scheduling for efficiency.
Training enhancements—such as position-dependent loss decay and random anchor sampling—improve convergence and acceptance length.
Target context features are critical; removing them leads to only modest gains, underscoring their role in effective diffusion-based drafting.

The authors use DFlash to accelerate inference across math, code, and chat tasks, achieving higher speedups and acceptance lengths when sampling is enabled compared to standard greedy decoding. Results show that sampling improves both efficiency and draft quality, particularly on Math500 and HumanEval, indicating that stochastic generation enhances speculative drafting performance. This suggests that incorporating sampling during inference can meaningfully boost the practical throughput of diffusion-based speculative decoding.

The authors evaluate DFlash draft models under varying training and inference block sizes, finding that models trained with larger block sizes (e.g., 16) achieve higher acceptance lengths and speedups, especially when inference block size matches training size. Models trained on larger blocks also generalize well to smaller inference block sizes, but not vice versa, suggesting flexibility in deployment settings. This behavior supports dynamic block-size scheduling to optimize efficiency under different computational constraints.

The authors evaluate DFlash on math reasoning tasks using LLaMA-3.1-8B-Instruct under greedy and non-greedy decoding, showing consistent speedups over the autoregressive baseline across GSM8K, Math500, AIME24, and AIME25. Results indicate that while higher temperature reduces both speedup and acceptance length, DFlash maintains strong performance, demonstrating robustness to sampling variability. These findings highlight the method’s effectiveness in accelerating complex reasoning tasks without sacrificing output quality.

The authors evaluate draft models with varying numbers of layers and find that a 5-layer configuration achieves the best balance between drafting quality and computational efficiency, yielding the highest overall speedup. While deeper models (8 layers) produce longer accepted sequences, their increased drafting cost reduces net performance gains. The results highlight that optimal draft model depth depends on the trade-off between acceptance length and inference latency.

The authors use DFlash to accelerate inference on LLaMA-3.1-8B-Instruct across multiple tasks and concurrency levels, comparing it against EAGLE-3 with varying tree sizes. Results show DFlash consistently achieves higher speedups and better acceptance lengths than both EAGLE-3 configurations, particularly at lower concurrency, demonstrating its efficiency and robustness in practical deployment settings.

Quell-PDF Code anzeigen

Inhaltsverzeichnis

KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren

Sofort einsatzbereite GPUs

Die besten Preise

Erste Schritte Preise anzeigen

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates

Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen

Unterstützt von MailChimp

vor 12 Stunden

Natürliche Sprachverarbeitung

Aufgabe

Jian Chen Yesheng Liang Zhijian Liu

Inhaltsverzeichnis

Zusammenfassung

One-sentence Summary

Key Contributions

DFlash introduces a speculative decoding framework that uses a lightweight block diffusion model to draft tokens in parallel, overcoming the sequential bottleneck of autoregressive drafting while maintaining lossless output quality.
It conditions the draft model on context features extracted from the target LLM, enabling high-acceptance-rate drafts through a single forward pass, effectively turning the draft model into a diffusion adapter that leverages the target’s reasoning.
Evaluated across multiple models and benchmarks including Qwen3-8B, DFlash achieves over 6× acceleration and up to 2.5× higher speedup than EAGLE-3, demonstrating superior efficiency under realistic serving conditions.

Introduction

Method

w_k = \exp\left(-\frac{k - 1}{\gamma}\right),

Experiment

DFlash consistently outperforms EAGLE-3 across math, code, and chat tasks, achieving up to 5.1x speedup over autoregressive decoding while maintaining higher acceptance length with lower verification overhead.
It demonstrates strong efficiency gains in reasoning mode and real-world serving scenarios via SGLang, reducing deployment cost without sacrificing output quality.
Ablation studies confirm that DFlash’s performance is robust to training data choices and benefits from deeper draft models (5–8 layers) and more target hidden features, though optimal layer count balances drafting cost and quality.
Block size significantly impacts performance: models trained on larger blocks generalize well to smaller inference blocks, enabling potential dynamic scheduling for efficiency.
Training enhancements—such as position-dependent loss decay and random anchor sampling—improve convergence and acceptance length.
Target context features are critical; removing them leads to only modest gains, underscoring their role in effective diffusion-based drafting.

Quell-PDF Code anzeigen

Inhaltsverzeichnis

KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren

Sofort einsatzbereite GPUs

Die besten Preise

Erste Schritte Preise anzeigen

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates

Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen

Unterstützt von MailChimp

Command Palette

DFlash: Block Diffusion für Flash-spekulative Decodierung

Jian Chen Yesheng Liang Zhijian Liu

Zusammenfassung

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

KI mit KI entwickeln

HyperAI Newsletters

Command Palette

DFlash: Block Diffusion für Flash-spekulative Decodierung

Jian Chen Yesheng Liang Zhijian Liu

Zusammenfassung

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

KI mit KI entwickeln

HyperAI Newsletters

Command Palette

DFlash: Block Diffusion für Flash-spekulative Decodierung

Jian Chen Yesheng Liang Zhijian Liu

Zusammenfassung

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

KI mit KI entwickeln

HyperAI Newsletters