HyperAIHyperAI

Command Palette

Search for a command to run...

Console

ReFusion:並列自己回帰デコーディングを備えた拡散大規模言語モデル

Jia-Nan Li Jian Guan Wei Wu Chongxuan Li

Abstract

自己回帰モデル(ARMs)は、逐次的な推論が遅いために制限を受けています。一方、マスク付き拡散モデル(MDMs)は並列処理の代替手段を提供しますが、以下の重大な欠点を抱えています:キー・バリュー(KV)キャッシュを活用できないため計算負荷が高く、トークン組み合わせの扱いが困難な空間における依存関係の学習によって生成結果が一貫性を欠く問題です。これらの課題に対処するため、本研究ではReFusionを提案します。これは、トークンレベルではなく、固定長かつ連続する部分列である「スロット」レベルで並列デコードを実現する新しいマスク付き拡散モデルです。この目的を達成するために、反復的な「計画・埋め込み」推論プロセスを採用しています。まず、拡散モデルを用いた計画ステップで弱い依存関係を持つスロット集合を特定し、その後、自己回帰的手法による埋め込みステップで選択されたスロットを並列にデコードします。スロットベースの設計により、統一された因果的フレームワーク下で完全なKVキャッシュ再利用が可能となり、同時に学習の複雑さをトークン組み合わせ空間から扱いやすいスロットレベルの順列空間へと大幅に低減できます。7つの多様なベンチマークにおける広範な実験結果から、ReFusionは従来のMDMsに対して平均34%の性能向上と18倍以上の高速化を達成するとともに、強力なARMsに近い性能を実現しつつ、平均2.33倍の高速化を維持することが明らかになりました。

One-sentence Summary

The authors from Renmin University of China and Ant Group propose ReFusion, a novel masked diffusion model that achieves efficient parallel decoding by operating at the slot level—fixed-length token subsequences—via an iterative plan-and-infill framework. This design enables full KV cache reuse and reduces learning complexity, outperforming prior MDMs by 34% with over 18× speedup and matching strong autoregressive models while maintaining a 2.33× average speed advantage across seven benchmarks.

Key Contributions

  • REFUSION addresses the fundamental inefficiency of autoregressive models (ARMs) and the coherence issues in masked diffusion models (MDMs) by introducing a slot-level parallel decoding framework, where fixed-length contiguous sub-sequences (slots) replace individual tokens as the unit of parallel generation, enabling both high throughput and coherent output.
  • The model employs an iterative "plan-and-infill" process: a diffusion-based planning step identifies weakly dependent slots, followed by an autoregressive infilling step that decodes them in parallel, allowing full KV cache reuse through a causal attention mechanism and reducing learning complexity from exponential token combinations to manageable slot permutations.
  • On seven diverse benchmarks including math, code, and reasoning tasks, REFUSION achieves 34% higher performance than prior MDMs like LLaDA and Dream, with over 18× faster throughput, and surpasses strong ARMs like Qwen3-8B by 3.68 absolute points on GSM8K and MBPP while maintaining a 2.33× average speedup.

Introduction

The authors leverage masked diffusion models (MDMs) to overcome the sequential decoding bottleneck of autoregressive models (ARMs), which limits inference throughput despite strong performance. While MDMs enable parallel token generation through iterative denoising and conditional independence assumptions, prior approaches face two key challenges: architectural incompatibility with KV caching due to bidirectional attention, leading to high latency, and incoherent outputs from failing to model complex token dependencies, especially for nearby tokens. To address these, the authors introduce ReFusion, a novel diffusion-based LLM that performs parallel decoding at the slot level—grouping tokens into fixed-length sub-sequences—using a two-step process: a diffusion-based planning step identifies weakly dependent slots, followed by autoregressive infilling. This design preserves causal attention for efficient KV caching and reduces learning complexity by shifting from an intractable token combination space to a manageable slot permutation space. ReFusion’s hybrid training objective uses both denoising and autoregressive losses across all tokens, improving data efficiency. Experiments show ReFusion achieves 34% higher average performance than prior MDMs while being over 18× faster, and it surpasses strong ARMs like Qwen3-8B in both accuracy and speed, pushing the performance-efficiency frontier.

Method

The authors leverage a novel slot-based architecture to address the inefficiency and incoherence challenges inherent in traditional masked diffusion models (MDMs). The core of this approach is an iterative "plan-and-infill" decoding process that operates at the slot level, where a slot is defined as a fixed-length, contiguous sub-sequence of tokens. This design elevates parallel decoding from the token level to a higher, more manageable level, enabling a unified causal framework that supports both global generation flexibility and full key-value (KV) cache reuse.

The overall framework, as illustrated in the figure below, consists of two primary phases: inference and training. During inference, the process begins with a prompt and a fully masked response sequence, which is partitioned into a series of slots. The decoding proceeds iteratively through two synergistic steps. First, a diffusion-based planning step generates draft tokens for all masked slots in parallel. This step leverages the model's ability to predict from a partially masked context to create a speculative guess for each slot. The model then scores these draft slots based on a certainty metric, such as the probability of the most likely token at the first position of the slot. A batch of high-confidence slots, exceeding a predefined threshold, is selected for the next phase. Second, an autoregressive infilling step decodes the selected slots in parallel. This step ensures local coherence by using the model's autoregressive capability to verify and complete the draft slots. To accelerate this process, a speculative decoding strategy is employed, where the model first performs a global verification on the concatenated draft slots. If a long enough prefix of tokens is verified, the corresponding slots are accepted wholesale, bypassing the need for costly suffix completion. Otherwise, a parallel iterative completion process refines each selected slot independently until it is fully completed. After each iteration, the newly completed slots are moved to the front of the remaining masked slots, a reordering that enables full KV cache reuse for all decoded tokens. This reordering is possible because the model uses consistent, ground-truth position IDs for all tokens, which are invariant to their physical position in the input buffer. By applying RoPE to these absolute position IDs, the model correctly computes relative distances and attends to all logical predecessors, maintaining sequence coherence despite the non-sequential input order.

The training process is meticulously designed to mirror the dynamics of the inference algorithm, ensuring the model learns both planning and infilling capabilities. The training data is constructed from prompt-response pairs by first partitioning the response into a sequence of slots. A corrupted version of this sequence is then created by randomly masking a subset of these slots. Crucially, the unmasked (clean) slots are randomly permuted to simulate the arbitrary generation order encountered during inference. The final training instance is assembled by concatenating the permuted clean slots followed by the masked slots. This data construction strategy ensures the model learns to process context in any arbitrary permutation. The training objective is a hybrid of two losses. The clean slots are trained with a standard autoregressive loss, which optimizes the model for next-token prediction. The masked slots are trained with a denoising loss, which optimizes the model for reconstructing the original tokens from a masked context. The final objective is a weighted sum of these two losses, allowing the model to learn both the global planning and local decoding capabilities required for the "plan-and-infill" process.

Experiment

  • REFUSION is evaluated on seven benchmarks: MMLU-Pro, ARC-C, GSM8K, MATH, GPQA, HumanEval, and MBPP, using pass@1 for code and accuracy for others, with inference throughput measured in tokens per second (TPS) on a single A100 GPU.
  • On HumanEval, REFUSION achieves 78.66% pass@1, surpassing the next-best MDM (Dream-7B-Instruct) by 22 points, and reaches 92.09 TPS on MBPP, 1.4× faster than the next-fastest MDM.
  • REFUSION outperforms all MDM baselines in both performance and throughput, and challenges strong ARMs: it achieves 2.33× average speedup over Qwen3-8B while exceeding it by 3.68 points on GSM8K and MBPP.
  • In controlled comparisons, REFUSION retrained on a 120K subset outperforms retrained Qwen3-8B by 16 points on HumanEval and is 1.9× faster, demonstrating architectural superiority independent of data or backbone advantages.
  • When compared to Dream-7B-Instruct on its native Qwen2.5-7B backbone—despite Dream’s massive pre-training—REFUSION achieves a 2.23% average performance gain and 11.05× speedup, excelling on reasoning and coding tasks.
  • Ablation studies confirm that REFUSION’s KV cache reuse strategy boosts throughput by 1.16–1.33× with no performance loss, and even slight improvements, due to reduced error propagation.
  • Hyperparameter analysis identifies a wide "sweet spot" where REFUSION surpasses Qwen3-8B in both performance and TPS: τ_slot ∈ [0.5, 1.0], τ_token ∈ [0.1, 0.9], k ∈ {8, 32}, and b ∈ [32, 128].
  • REFUSION shows strong scaling with data: throughput increases from 51 TPS (120K samples) to over 81 TPS (14M samples), with performance improving significantly despite non-monotonic trends due to fixed training epochs.
  • REFUSION’s flat trade-off frontier (Figure 6) confirms its ability to maintain performance under high parallelism, unlike LLaDA and Dream, which suffer sharp declines.
  • Case studies demonstrate REFUSION’s high parallelism and non-linear generation order, enabling efficient, human-like problem solving in code generation, with superior structure and quality compared to baselines.

Results show that REFUSION (Retrained) outperforms Dream-7B-Instruct across all benchmarks, achieving a 2.23% average performance gain and a 11.05× speedup. The authors use this comparison to demonstrate that REFUSION's architectural advantages are robust, even when trained with significantly fewer resources than the non-open-source baseline.

The authors use REFUSION to achieve a non-autoregressive generation approach that combines diffusion-based planning with causal infilling, enabling both high performance and efficiency. Results show that REFUSION outperforms MDM baselines like LLaDA and Dream-7B-Instruct in both speed and accuracy, while also challenging strong autoregressive models such as Qwen3-8B by delivering superior performance and a 2.33× average speedup across tasks.

The authors use a controlled comparison to evaluate REFUSION against retrained baselines, showing that REFUSION outperforms Qwen3-8B, LLaDA, and BD3-LMs across all benchmarks despite being trained on a smaller dataset. Results indicate that REFUSION achieves higher accuracy and faster inference, with a 1.9× speedup over Qwen3-8B on average, demonstrating that its architectural design enables superior performance even when data advantages are removed.

The authors use REFUSION, a masked diffusion model, to achieve state-of-the-art performance and efficiency across multiple benchmarks. Results show that REFUSION outperforms all autoregressive and masked diffusion baselines in both accuracy and throughput, with significant gains on coding and reasoning tasks, demonstrating its ability to break the traditional speed-quality trade-off.

The authors use REFUSION to achieve a balance between performance and efficiency, outperforming both autoregressive and masked diffusion models in throughput and accuracy. Results show that REFUSION, despite having fewer activated parameters than some baselines, achieves competitive or superior performance across benchmarks while maintaining high inference speed.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

Hyper Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています
ReFusion:並列自己回帰デコーディングを備えた拡散大規模言語モデル | Papers | HyperAI超神経