HyperAIHyperAI

Command Palette

Search for a command to run...

柔軟性の罠:拡散言語モデルにおける推論可能性を制限する任意の順序制限の理由

Abstract

拡散型大規模言語モデル(dLLMs)は、従来のLLMsに見られる左から右への固定された順序制約を打破し、トークンの生成を任意の順序で行うことが可能となる。直感的には、この柔軟性は固定された自己回帰的経路を厳密に包含する解空間を提供し、数学やコーディングといった一般タスクにおける優れた推論能力の発揮を理論的に可能にする。その結果、多数の研究が強化学習(RL)を活用してdLLMsの推論能力を引き出そうとしている。本論文では、予想に反して、現在の形での任意順序生成は、推論の境界を広げるのではなく、むしろ狭めてしまうという逆説的な事実を明らかにする。我々の調査では、dLLMsが探索に不可欠な高不確実性を伴うトークンを回避するためにこの順序の柔軟性を悪用する傾向があることが判明した。その結果、解空間が早期に崩壊してしまう。この観察は、既存のdLLMs向けRLアプローチの前提を根本から問い直すものであり、組合せ的経路の取り扱いや計算不能な尤度の処理といった、極めて複雑な課題に多くのリソースが割かれてきたにもかかわらず、その柔軟性を維持する努力が、実際には推論性能の向上を妨げている可能性を示唆している。我々は、任意順序の使用を意図的に放棄し、標準的なグループ相対方策最適化(GRPO)を適用することで、より効果的な推論を引き出すことが可能であることを示す。本研究で提案する手法「JustGRPO」は、極めてシンプルながら驚くほど有効であり(例:GSM8Kで89.1%の精度)、dLLMsが持つ並列デコード能力を完全に維持している。プロジェクトページ:https://nzl-thu.github.io/the-flexibility-trap

One-sentence Summary

Researchers from Tsinghua University and Alibaba Group reveal that arbitrary-order generation in diffusion LLMs (dLLMs) paradoxically narrows reasoning potential by bypassing high-uncertainty logical tokens. They propose JustGRPO, a minimalist RL method using standard autoregressive training that boosts performance (e.g., 89.1% on GSM8K) while preserving parallel decoding.

Key Contributions

  • Diffusion LLMs’ arbitrary-order generation, while theoretically expansive, paradoxically narrows reasoning potential by letting models bypass high-uncertainty tokens that are critical for exploring diverse solution paths, as measured by Pass@k on benchmarks like GSM8K and MATH.
  • The paper reveals that this “flexibility trap” stems from entropy degradation: models prioritize low-entropy tokens first, collapsing branching reasoning paths before they can be explored, unlike autoregressive decoding which forces confrontation with uncertainty at critical decision points.
  • To counter this, the authors propose JustGRPO — a minimalist method that trains dLLMs under standard autoregressive order using Group Relative Policy Optimization — achieving strong results (e.g., 89.1% on GSM8K) while preserving parallel decoding at inference, without complex diffusion-specific RL adaptations.

Introduction

The authors leverage diffusion language models (dLLMs), which theoretically support arbitrary token generation order, to challenge the assumption that this flexibility enhances reasoning. Prior work assumed arbitrary-order decoding could unlock richer reasoning paths, leading to complex reinforcement learning (RL) methods designed to handle combinatorial trajectories and intractable likelihoods — but these approaches often rely on unstable approximations. The authors reveal that, counterintuitively, arbitrary order causes models to bypass high-uncertainty tokens critical for exploring diverse reasoning paths, collapsing the solution space prematurely. Their main contribution, JustGRPO, discards arbitrary-order complexity and trains dLLMs using standard autoregressive RL (Group Relative Policy Optimization), achieving strong results (e.g., 89.1% on GSM8K) while preserving parallel decoding at inference.

Dataset

  • The authors use the official training splits of mathematical reasoning datasets, adhering to standard protocols from prior work (Zhao et al., 2025; Ou et al., 2025).
  • For code generation, they adopt AceCoder-87K (Zeng et al., 2025), then filter it using the DiffuCoder pipeline (Gong et al., 2025) to retain 21K challenging samples that include verifiable unit tests.
  • The data is used directly as training input without further mixture ratios or cropping; no metadata construction or additional preprocessing is mentioned beyond the described filtering.

Method

The authors leverage a diffusion-based framework for language modeling, where the core mechanism operates through a masked diffusion process. The model, referred to as a Masked Diffusion Model (MDM), generates sequences by iteratively denoising a partially masked input state xtx_txt, which is initialized from a fully masked sequence. This process is governed by a continuous time variable t[0,1]t \in [0, 1]t[0,1], representing the masking ratio. In the forward process, each token in the clean sequence x0x_0x0 is independently masked with probability ttt, resulting in the distribution q(xtkx0k)q(x_t^k | x_0^k)q(xtkx0k), which either retains the original token or replaces it with a [MASK] token. Unlike traditional Gaussian diffusion models, MDMs directly predict the clean token at masked positions. A neural network pθ(x0xt)p_\theta(x_0 | x_t)pθ(x0xt) estimates the original token distribution, and the model is trained by minimizing the Negative Evidence Lower Bound, which simplifies to a weighted cross-entropy loss over the masked tokens.

As shown in the figure below, the model's generation process can be constrained to follow an autoregressive (AR) order, where tokens are generated sequentially from left to right, or it can follow an arbitrary order, where tokens are generated in a non-sequential manner. The AR order approach confronts uncertainty by generating tokens in a structured, sequential fashion, which is beneficial for reasoning tasks. In contrast, the arbitrary order approach bypasses uncertainty by allowing for non-sequential generation, which can lead to suboptimal outcomes as indicated by the "Too hard, bypass!" and "Now it is easier" annotations. This distinction highlights the importance of the generation order in the model's performance.

To bridge the gap between the sequence-level denoising architecture of diffusion models and the autoregressive policy framework, the authors propose a method called JustGRPO. This method explicitly forgoes arbitrary-order generation during the reinforcement learning stage, transforming the diffusion language model into a well-defined autoregressive policy πθAR\pi_\theta^{\text{AR}}πθAR. The autoregressive policy is defined by constructing an input state x~t\tilde{x}_tx~t where the past tokens are observed and the future tokens are masked. The probability of the next token oto_tot given the history o<to_{<t}o<t is defined as the softmax of the model logits at the position corresponding to oto_tot. This formulation enables the direct application of standard Group Relative Policy Optimization (GRPO) to diffusion language models. The GRPO objective maximizes a clipped surrogate function with a KL regularization term, where the advantage is computed by standardizing the reward against the group statistics. This approach allows the model to achieve the reasoning depth of autoregressive models while preserving the inference speed of diffusion models.

Experiment

  • Evaluated reasoning potential via Pass@k on dLLMs (LLaDA-Instruct, Dream-Instruct, LLaDA 1.5) across GSM8K, MATH500, HumanEval, MBPP: AR decoding outperforms arbitrary order in scaling with k, revealing broader solution space coverage (e.g., AR solves 21.3% more HumanEval problems at k=1024).
  • Identified “entropy degradation” in arbitrary order: bypassing high-entropy logical tokens (e.g., “Therefore”, “Since”) collapses reasoning paths into low-entropy, pattern-matching trajectories, reducing exploration.
  • Introduced JustGRPO: enforcing AR order during RL training on LLaDA-Instruct yields state-of-the-art results—89.1% on GSM8K (↑3.0% over SPG), 6.1% gain on MATH-500 over ESPO—with consistent gains across sequence lengths (128, 256, 512).
  • JustGRPO preserves parallel decoding: under EB sampler, accuracy improves with parallelism (e.g., +25.5% on MBPP at ~5 tokens/step vs. +10.6% at 1 token/step), indicating robust reasoning manifold.
  • Ablations confirm findings: smaller block sizes (more AR-like) improve Pass@k; higher temperatures help arbitrary order but can’t match AR; advanced samplers correlate highly with AR (0.970) but still underperform.
  • Training efficiency: JustGRPO surpasses approximation-based ESPO in accuracy-wall-clock time trade-off; heuristic gradient restriction to top-25% entropy tokens accelerates convergence without performance loss.

The authors use JustGRPO to train diffusion language models with an autoregressive constraint during reinforcement learning, achieving state-of-the-art performance across multiple reasoning and coding benchmarks. Results show that this approach consistently outperforms methods designed for arbitrary-order decoding, with significant accuracy gains on GSM8K, MATH-500, HumanEval, and MBPP, while also preserving the model's parallel decoding capabilities at inference.

The authors use Pass@k to measure reasoning potential, showing that while arbitrary order performs competitively at k=1, AR order demonstrates significantly stronger scaling behavior as the number of samples increases. Results show that JustGRPO achieves state-of-the-art performance across all benchmarks, outperforming prior methods on GSM8K, MATH-500, HumanEval, and MBPP, with consistent gains across different generation lengths.

The authors use a system-level comparison to evaluate the performance of JustGRPO against existing reinforcement learning methods on reasoning and coding benchmarks. Results show that JustGRPO achieves state-of-the-art performance across all tasks and sequence lengths, outperforming previous methods such as SPG and ESPO, particularly on GSM8K and MATH-500, indicating that enforcing autoregressive order during training enhances reasoning capabilities.

The authors use the GRPO algorithm with a base model of LLaDA 8B Instruct, training it for 125 steps with a constant learning rate of 5 × 10⁻⁶ and a group size of 16. The model achieves strong performance with a sampling temperature of 1.0 and no KL penalty, indicating that exact likelihood computation during training is effective despite higher computational cost.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています