Command Palette
Search for a command to run...
WeDLM: 빠른 추론을 위한 확산 언어 모델과 표준 인과 주의의 조화
WeDLM: 빠른 추론을 위한 확산 언어 모델과 표준 인과 주의의 조화
Aiwei Liu Minghua He Shaoxun Zeng Sijun Zhang Linhao Zhang Chuhan Wu Wei Jia Yuan Liu Xiao Zhou Jie Zhou
WeDLM 고효율 대규모 언어 모델 디코딩 프레임워크
초록
자기회귀(Autoregressive, AR) 생성은 대규모 언어 모델(Large Language Models, LLMs)의 표준 디코딩 패러다임이지만, 토큰 단위로 순차적으로 생성하는 특성상 추론 시 병렬 처리가 제한된다. 확산 언어 모델(Diffusion Language Models, DLLMs)은 한 단계에서 다수의 마스킹된 토큰을 동시에 복원함으로써 병렬 디코딩을 가능하게 하지만, 실제로는 최적화된 AR 엔진(예: vLLM)에 비해 배포 성능 향상으로 이어지지 않는 경우가 많다. 주요 원인 중 하나는 많은 DLLMs가 양방향 어텐션(bidirectional attention)에 의존하기 때문에 표준 전후고정(KV 캐싱)을 깨뜨리고 반복적인 문맥 재구성(repeated contextualization)을 유도하여 효율성을 저해하기 때문이다. 본 연구에서는 표준 인과적 어텐션(causal attention)만을 사용하여 병렬 생성이 가능한 전후고정 캐싱 친화적인 확산 디코딩 프레임워크인 WeDLM을 제안한다. 핵심 아이디어는 각 마스킹된 위치가 현재 관측된 모든 토큰에 조건부로 의존하면서도 엄격한 인과적 마스크(causal mask)를 유지하도록 하는 것으로, 이를 위해 관측된 토큰을 물리적 접두사(physical prefix)로 이동시키되 논리적 위치는 유지하는 '위계적 재정렬(Topological Reordering)' 기법을 도입한다. 이 성질을 기반으로, 신뢰도가 높은 토큰을 지속적으로 증가하는 좌측에서 우측으로의 접두사에 지속적으로 기록하고 고정된 병렬 작업 부하를 유지하는 스트리밍 디코딩 절차를 제안한다. 이는 블록 기반 확산 방법에서 흔히 나타나는 정지-대기(stop-and-wait) 동작을 피한다. 실험 결과, WeDLM은 강력한 AR 기반 모델의 품질을 유지하면서도 상당한 속도 향상을 제공함을 보였다. 특히 어려운 추론 벤치마크에서는 최대 3배, 낮은 엔트로피 생성 환경에서는 최대 10배의 성능 향상을 달성하였다. 특히 중요한 점은, 우리의 비교 실험은 vLLM을 통해 배포된 AR 기반 모델과 동일한 배포 환경에서 수행되었으며, 이는 확산형 디코딩이 실제 최적화된 AR 엔진을 능가할 수 있음을 입증한다.
One-sentence Summary
Researchers from WeChat AI, Tencent, Peking University, and Tsinghua University propose WeDLM, a diffusion-based decoder using causal attention and topological reordering to enable prefix-caching and streaming decoding, achieving up to 10× speedups over vLLM-optimized AR models while preserving quality.
Key Contributions
- WeDLM introduces a diffusion decoding framework using only causal attention, enabling prefix KV caching by reordering observed tokens to the physical prefix while preserving their logical positions via Topological Reordering, thus avoiding the repeated contextualization that plagues bidirectional diffusion models.
- It implements a streaming decoding procedure that incrementally commits confident tokens into a growing left-to-right prefix and maintains fixed parallel workload per step, eliminating the stop-and-wait inefficiency common in block diffusion methods and aligning with optimized AR inference engines.
- Experiments show WeDLM matches strong AR baselines in quality while achieving up to 3× speedup on reasoning tasks and 10× in low-entropy regimes, with direct comparisons against vLLM-optimized AR models under identical deployment conditions, proving practical gains over state-of-the-art AR serving.
Introduction
The authors leverage diffusion language models to enable fast, parallel inference while preserving standard causal attention—avoiding the bidirectional attention typically used in masked diffusion models. Prior work relied on bidirectional context for mask recovery, which impedes efficient decoding and complicates integration with existing autoregressive infrastructure. The authors’ main contribution is a novel framework, WeDLM, that achieves parallel decoding under causal attention by enforcing two algorithmic principles: topological reordering of tokens and position-aware masking, enabling compatibility with KV caching and existing AR systems without architectural overhaul.
Method
The authors leverage a novel diffusion decoding framework called WeDLM, which is designed to reconcile parallel token generation with the efficiency constraints of industrial-grade autoregressive inference engines. The core innovation lies in enforcing strict causal attention throughout both training and inference, thereby enabling seamless integration with standard KV caching mechanisms such as FlashAttention and PagedAttention. This is achieved through two key components: Topological Reordering for training and Streaming Parallel Decoding for inference.
During training, WeDLM employs Topological Reordering to ensure that masked tokens can access the full context of observed tokens while operating under a standard causal mask. As shown in the figure below, the input sequence is first masked at random positions. Then, all observed tokens are physically moved to the front of the sequence, while masked tokens are placed at the end. Crucially, logical position IDs (e.g., via RoPE) are preserved, allowing attention scores to reflect true positional relationships rather than physical indices. This reordering enables each masked token to attend to all observed tokens under a causal mask, satisfying the requirement for global context visibility without bidirectional attention.
To bridge the training-inference gap induced by prefix-conditioned decoding, the authors introduce Dual-Stream Masking. This strategy constructs two parallel streams: a clean Memory Stream and a masked Prediction Stream, both sharing the same logical position IDs. The Prediction Stream is partitioned into blocks, each of which undergoes intra-block topological reordering. The attention mask is carefully designed so that tokens in the Prediction Stream can attend to clean context from the Memory Stream for preceding blocks, while within a block, attention follows standard causal masking. This mimics the inference setting where earlier tokens are already committed and serve as clean context, reducing distributional mismatch.
For inference, WeDLM implements Streaming Parallel Decoding, a procedure that continuously commits confident tokens into a growing left-to-right prefix while maintaining a fixed parallel workload. The algorithm operates on a sliding window of fixed size W, containing a mix of filled (predicted) and masked tokens. At each step, the window is reordered to place filled tokens before masks, preserving logical positions. A causal forward pass is then executed over the window, conditioned on the persistent KV cache. The leftmost contiguous filled prefix is immediately committed to the cache, as its KV states depend only on earlier tokens under the causal mask. New mask tokens are appended to refill the window, avoiding the stop-and-wait behavior of block-wise methods.
Refer to the framework diagram comparing block decoding with WeDLM’s streaming approach. Block decoding requires the entire block to be finalized before any token can be cached, leading to pipeline bubbles. In contrast, WeDLM’s streaming method commits tokens as soon as they are resolved, enabling continuous parallel prediction. The authors further enhance left-to-right growth by applying a distance-penalized entropy selection rule, which prioritizes earlier positions for prediction, thereby maximizing prefix cacheability pcache.
Experiment
- WeDLM consistently matches or exceeds its autoregressive base models in generation quality, especially on reasoning and code tasks, while significantly outperforming prior diffusion language models.
- Instruct-tuned WeDLM models show strong gains over AR baselines on complex reasoning and coding benchmarks, confirming compatibility with instruction tuning and potential for performance amplification.
- Inference efficiency is highly tunable: streaming decoding and left-position biasing improve speed via better prefix caching, with flexible trade-offs between accuracy and throughput.
- Ablation studies reveal robustness to block size, superiority of causal intra-block attention for caching and performance, and stronger adaptation gains in larger base models.
- Generation speed varies sharply by task entropy: deterministic or structured tasks enable 8x+ speedups, while high-entropy open-ended generation shows limited acceleration, highlighting a key area for future improvement.
The authors use WeDLM to enhance base autoregressive models, achieving higher average scores than both their AR counterparts and prior diffusion language models across reasoning, math, and code benchmarks. Results show consistent gains in math and code tasks, with WeDLM-8B outperforming all compared models in several high-difficulty categories, while maintaining competitive performance in general reasoning. The method demonstrates that diffusion-style training can preserve and even improve upon the capabilities of strong AR checkpoints without compromising inference efficiency.

The authors use WeDLM to enhance autoregressive base models through diffusion-style training and parallel decoding, achieving consistent gains over both AR and diffusion baselines on reasoning and code generation tasks. Results show that WeDLM-8B-Instruct outperforms its AR counterpart and all compared diffusion models, particularly on challenging benchmarks like GPQA-Diamond and HumanEval, while maintaining flexibility in speed-accuracy trade-offs during inference. The method demonstrates that diffusion objectives can complement instruction tuning without compromising performance, especially when starting from strong base models.
