한 달 전

Zihan Wang Chi Gui Xing Jin Qineng Wang Licheng Liu Kangrui Wang Shiqi Chen Linjie Li Zhengyuan Yang Pingyue Zhang

초록

다중 턴(multi-turn) LLM agent의 RL training은 본질적으로 불안정하며, 추론(reasoning)의 품질이 작업 성능을 직접적으로 결정합니다. Entropy는 추론의 안정성을 추적하는 데 널리 사용되지만, Entropy는 동일한 입력 내에서의 다양성만을 측정할 뿐, 추론이 실제로 서로 다른 입력에 적절히 반응하는지 여부는 알려주지 못합니다. RAGEN-2 연구를 통해 우리는 Entropy가 안정적이더라도 모델이 겉으로는 다양해 보이지만 입력과는 무관한(input-agnostic) 고정된 템플릿에 의존할 수 있음을 발견했습니다. 우리는 이를 '템플릿 붕괴(template collapse)'라고 명명하였으며, 이는 Entropy 및 기존의 모든 metric으로는 포착할 수 없는 실패 모드입니다.이러한 실패를 진단하기 위해, 우리는 추론 품질을 입력 내 다양성(Entropy)과 입력 간 구별 가능성(Mutual Information, MI)으로 분해하고, 온라인 진단을 위한 일련의 mutual information proxy들을 도입했습니다. 다양한 작업에 걸쳐 실험한 결과, Mutual Information은 Entropy보다 최종 성능과 훨씬 더 강력한 상관관계를 보였으며, 이는 추론 품질을 나타내는 더욱 신뢰할 수 있는 proxy임을 입증합니다.나아가 우리는 신호 대 잡음비(SNR) 메커니즘을 통해 템플릿 붕괴 현상을 설명합니다. 낮은 reward variance는 task gradient를 약화시켜, regularization term이 지배적으로 작용하게 함으로써 입력 간의 추론 차이를 지워버립니다. 이를 해결하기 위해, 우리는 reward variance를 경량 proxy로 사용하여 iteration마다 고신호(high-signal) prompt를 선택하는 SNR-Aware Filtering을 제안합니다. Planning, math reasoning, web navigation, code execution 등 다양한 영역에서 이 방법은 입력 의존성(input dependence)과 작업 성능을 모두 일관되게 향상시켰습니다.

One-sentence Summary

By identifying template collapse as a failure mode where agentic reinforcement learning models adopt input-agnostic reasoning patterns despite stable entropy, the RAGEN-2 study proposes using mutual information proxies for diagnosis and introduces SNR-Aware Filtering to improve performance across planning, math reasoning, web navigation, and code execution tasks.

Key Contributions

The paper introduces the concept of template collapse, a failure mode in multi-turn LLM agent training where models rely on input-agnostic reasoning templates that appear diverse but do not respond to specific inputs.
This work presents a new diagnostic framework that decomposes reasoning quality into within-input diversity and cross-input distinguishability through a family of mutual information proxies. These proxies demonstrate a stronger correlation with final task performance than traditional entropy metrics.
The researchers propose SNR-Aware Filtering, a method that uses reward variance as a proxy for signal strength to select high-signal prompts during training. Experiments across planning, math, web navigation, and code execution show that this approach improves both input dependence and overall task performance.

Introduction

Training multi-turn LLM agents using reinforcement learning is a critical task for developing autonomous reasoning systems, but it is inherently unstable. While researchers typically use entropy to monitor reasoning stability, entropy only measures diversity within a single input and fails to detect when a model begins to rely on fixed, input-agnostic templates. The authors identify this phenomenon as template collapse, a failure mode where reasoning appears diverse but loses its dependence on the specific input. To address this, the authors leverage a mutual information (MI) proxy to diagnose input dependence and introduce SNR-Aware Filtering, which uses reward variance to select high-signal prompts and maintain effective task gradients during training.

Dataset

Dataset Composition and Sources: The authors utilize a diverse testbed of seven synthetic, fully controllable environments designed to evaluate various decision-making regimes. The environments include Sokoban (grid puzzles), FrozenLake (navigation), MetaMathQA (mathematical reasoning), Countdown (arithmetic games), SearchQA (multi-turn search), WebShop (e-commerce navigation), and DeepCoder (program synthesis). DeepCoder specifically draws from PrimeIntellect, TACO, and LiveCodeBench v5.
Key Subset Details:
- Sokoban: Uses procedurally generated puzzles with configurable dimensions and box counts to study irreversible planning.
- FrozenLake: A navigation task featuring a 2% random transition rate to simulate stochastic dynamics and sparse rewards.
- MetaMathQA: A math QA task where correctness is determined by exact matches with ground truth.
- Countdown: A compositional arithmetic task where agents must construct expressions to reach a target number.
- SearchQA: A multi-turn environment requiring iterative web search and information synthesis.
- WebShop: An interactive e-commerce simulation with a large action space and realistic product catalogs.
- DeepCoder: A coding benchmark where agents generate Python functions to pass specific test cases.
Training and Usage: The authors train a Qwen2.5-3B model using the veRL/HybridFlow stack. The training process involves comparing different RL algorithms, including PPO, DAPO, GRPO, and Dr. GRPO, for up to 400 rollout-update iterations. In each iteration, the model collects 128 trajectories per environment using a prompt batch size of 8 and a group size of 16 trajectories per prompt.
Processing and Reward Engineering:
- Reward Shaping: The authors apply specific reward structures to guide learning, such as a diminishing reward scheme for MetaMathQA (halving the reward for each subsequent retry) and multi-tier rewards for Countdown based on format and solution correctness.
- SNR-Aware Filtering: When applying this filtering technique, the authors reduce the effective minibatch size by the keep rate and scale the per-step loss accordingly to maintain a comparable optimization step size.

Method

The authors address the challenge of template collapse in closed-loop multi-turn agent reinforcement learning, where a policy $\pi_{\theta}$ generates reasoning tokens $z_t$ and executable actions $a_t$ in response to observations $o_t$ , forming trajectories $\tau = \{o_t, z_t, a_t, r_t\}_{t=1}^{T}$ . A key insight is that standard reinforcement learning objectives, such as PPO or GRPO, apply uniform regularization (e.g., KL divergence, entropy bonus) across all inputs, which can inadvertently promote input-agnostic reasoning. This phenomenon is characterized by a low mutual information $I(X;Z)$ between the input prompt $X$ and the generated reasoning $Z$ , indicating the model fails to adapt its reasoning to the specific problem. The authors formalize this problem through a signal-to-noise ratio (SNR) analysis of policy gradients, identifying low within-prompt reward variance as the primary cause of template collapse.

Four Quandrants of Reasoning on Entropy H and MI I

The framework for understanding reasoning regimes is established by analyzing two key dimensions: within-input diversity, measured by conditional entropy $H(Z \mid X)$ , and input dependence, measured by mutual information $I(X;Z)$ . As shown in the figure above, these two axes define four distinct reasoning regimes. High $H(Z \mid X)$ and high $I(X;Z)$ correspond to diverse and input-grounded reasoning, where the model adapts its thought process to the specific input. Conversely, low $H(Z \mid X)$ and low $I(X;Z)$ define a "Low-Entropy Collapse" regime, characterized by deterministic, template-like responses that are weakly input-grounded. The authors argue that the standard practice of using entropy regularization to increase $H(Z \mid X)$ can be counterproductive, as it may not increase $I(X;Z)$ and can even cause it to decrease, as formalized in Theorem M.2. The core mechanism of template collapse is a dominance of reward-agnostic regularization over task-relevant signal, which is particularly pronounced on prompts with low reward variance.

Task-Relevant and Task-Agnostic Gradients

The authors provide a detailed gradient-level explanation of this phenomenon, illustrated in the figure above. In high-SNR regimes, the task gradient $g_{\text{task}}$ is strong and distinct, representing a clear signal to improve the policy. This strong signal is amplified by the reward variance, as shown by the Cauchy-Schwarz inequality bound in Theorem H.2. The regularization gradient $g_{\text{reg}}$ acts as a noise term, but its influence is outweighed by the strong task signal. In contrast, on low-RV prompts, the task gradient $g_{\text{task}}$ weakens significantly, while the regularization gradient $g_{\text{reg}}$ remains constant. This leads to a situation where the total update is dominated by the input-agnostic regularization noise, pushing the policy towards a state of low mutual information $I(X;Z)$ . This is visualized as a weak, ambiguous direction for the update gradient $g$ , which can lead to policy drift away from the optimal policy $\theta^*$ .

To mitigate this issue, the authors propose a method called SNR-Aware Filtering. This approach directly addresses the root cause by selecting training examples based on their signal quality. The workflow, depicted in the figure above, operates in three steps. First, during sampling, the policy generates multiple trajectories for each prompt $x$ . Second, the within-prompt reward variance $\text{RV}(x)$ is computed for each prompt as a proxy for the task signal strength. Prompts with low variance are identified as having weak signal. Third, a filtering mechanism is applied to retain only the high-signal prompts. The authors use a top- $p$ filtering strategy, which ranks prompts by descending reward variance and retains the smallest prefix whose cumulative variance mass reaches a fraction $\rho$ of the total variance mass. This adaptive selection ensures that the policy update is concentrated on high-SNR prompts, effectively filtering out low-variance rollouts that would be dominated by input-agnostic regularization. This process prevents the degradation of $I(X;Z)$ and restores input-conditioned reasoning.

Experiment

The experiments evaluate the phenomenon of template collapse in reinforcement learning agents by analyzing gradient dynamics and mutual information across various tasks, algorithms, and model scales. The results demonstrate that low reward variance causes task-discriminative gradients to be overwhelmed by input-agnostic regularization, leading to reasoning that is fluent but ignores input specifics. Implementing SNR-Aware Top-p filtering consistently improves task performance and preserves information content by prioritizing high-signal updates, proving more effective than entropy-based diagnostics or regularization alone.

The the the table outlines key characteristics of different environments used in the experiments, including their stochasticity, turn structure, state representation, and reward type. These features help categorize the tasks and inform the experimental setup. Environments vary in stochasticity, with some being stochastic and others deterministic. Multi-turn tasks involve multiple interaction steps, while single-turn tasks have a single step. State representations differ between grid-based and text-based formats, and reward types are either dense or binary.

The authors compare different intervention strategies in reinforcement learning training, showing that SNR-Aware Filtering preserves task performance and reasoning diversity while preventing the decline in mutual information that occurs with no filtering. Without filtering, task success drops sharply after an initial peak, retrieval accuracy declines, and reasoning entropy increases, indicating template collapse. In contrast, filtering maintains stable retrieval accuracy and low entropy throughout training. SNR-Aware Filtering prevents the decline in task performance and retrieval accuracy seen in the no-filter baseline. Without filtering, reasoning entropy increases significantly, signaling a loss of input-specific reasoning. Filtering maintains stable mutual information and reasoning diversity, avoiding the degradation observed in unfiltered training.

The experiment evaluates the impact of reward variance (RV) on model performance by grouping prompts into quartiles based on RV. Results show that task performance and mutual information (MI) proxy decrease monotonically as RV decreases, indicating that higher reward variance correlates with better learning outcomes. The lowest RV quartile exhibits the weakest task performance and MI, while the highest RV quartile achieves the best results. Task performance and MI proxy decline monotonically across quartiles as reward variance decreases. The highest reward variance quartile achieves the best task performance and MI proxy. The lowest reward variance quartile shows the weakest task performance and MI proxy.

The authors evaluate SNR-Aware Filtering across various RL algorithms, model scales, and input modalities. Results show that filtering consistently improves peak task success rates across different configurations, demonstrating its effectiveness as a general-purpose method to enhance learning efficiency. SNR-Aware Filtering improves performance across multiple RL algorithms and model scales. The method consistently increases peak task success rates in most experimental settings. Gains are observed across different input modalities, including text and image-conditioned models.

SNR-Aware Filtering improves performance

The the the table outlines different mutual information proxy metrics used to assess reasoning quality in agent training. These proxies vary in their formulation and computational approach, with some focusing on retrieval accuracy and others on normalized scores or entropy-based estimates. The metrics are designed to detect template collapse by measuring input dependence and reasoning diversity. The the the table lists multiple MI proxy metrics, including retrieval-based and normalized continuous scores, to evaluate reasoning quality. Different proxies use distinct formulas, such as argmax-based selection or marginal differences, to estimate mutual information. Some proxies, like MI-ZScore-EMA, incorporate smoothing and normalization to track MI dynamics more robustly during training.

The experiments evaluate various reinforcement learning environments and intervention strategies to assess their impact on task performance and reasoning diversity. The results demonstrate that SNR-Aware Filtering prevents template collapse and maintains stable reasoning quality, whereas unfiltered training leads to a decline in mutual information and retrieval accuracy. Furthermore, the findings show that higher reward variance correlates with improved learning outcomes and that the filtering method consistently enhances peak success rates across different algorithms, model scales, and input modalities.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

한 달 전

Zihan Wang Chi Gui Xing Jin Qineng Wang Licheng Liu Kangrui Wang Shiqi Chen Linjie Li Zhengyuan Yang Pingyue Zhang

초록

One-sentence Summary

Key Contributions

The paper introduces the concept of template collapse, a failure mode in multi-turn LLM agent training where models rely on input-agnostic reasoning templates that appear diverse but do not respond to specific inputs.
This work presents a new diagnostic framework that decomposes reasoning quality into within-input diversity and cross-input distinguishability through a family of mutual information proxies. These proxies demonstrate a stronger correlation with final task performance than traditional entropy metrics.
The researchers propose SNR-Aware Filtering, a method that uses reward variance as a proxy for signal strength to select high-signal prompts during training. Experiments across planning, math, web navigation, and code execution show that this approach improves both input dependence and overall task performance.

Introduction

Dataset

Dataset Composition and Sources: The authors utilize a diverse testbed of seven synthetic, fully controllable environments designed to evaluate various decision-making regimes. The environments include Sokoban (grid puzzles), FrozenLake (navigation), MetaMathQA (mathematical reasoning), Countdown (arithmetic games), SearchQA (multi-turn search), WebShop (e-commerce navigation), and DeepCoder (program synthesis). DeepCoder specifically draws from PrimeIntellect, TACO, and LiveCodeBench v5.
Key Subset Details:
- Sokoban: Uses procedurally generated puzzles with configurable dimensions and box counts to study irreversible planning.
- FrozenLake: A navigation task featuring a 2% random transition rate to simulate stochastic dynamics and sparse rewards.
- MetaMathQA: A math QA task where correctness is determined by exact matches with ground truth.
- Countdown: A compositional arithmetic task where agents must construct expressions to reach a target number.
- SearchQA: A multi-turn environment requiring iterative web search and information synthesis.
- WebShop: An interactive e-commerce simulation with a large action space and realistic product catalogs.
- DeepCoder: A coding benchmark where agents generate Python functions to pass specific test cases.
Training and Usage: The authors train a Qwen2.5-3B model using the veRL/HybridFlow stack. The training process involves comparing different RL algorithms, including PPO, DAPO, GRPO, and Dr. GRPO, for up to 400 rollout-update iterations. In each iteration, the model collects 128 trajectories per environment using a prompt batch size of 8 and a group size of 16 trajectories per prompt.
Processing and Reward Engineering:
- Reward Shaping: The authors apply specific reward structures to guide learning, such as a diminishing reward scheme for MetaMathQA (halving the reward for each subsequent retry) and multi-tier rewards for Countdown based on format and solution correctness.
- SNR-Aware Filtering: When applying this filtering technique, the authors reduce the effective minibatch size by the keep rate and scale the per-step loss accordingly to maintain a comparable optimization step size.

Method

Experiment

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

RAGEN-2: Agentic RL에서의 Reasoning Collapse

Zihan Wang Chi Gui Xing Jin Qineng Wang Licheng Liu Kangrui Wang Shiqi Chen Linjie Li Zhengyuan Yang Pingyue Zhang6 more

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

RAGEN-2: Agentic RL에서의 Reasoning Collapse

Zihan Wang Chi Gui Xing Jin Qineng Wang Licheng Liu Kangrui Wang Shiqi Chen Linjie Li Zhengyuan Yang Pingyue Zhang6 more

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

RAGEN-2: Agentic RL에서의 Reasoning Collapse

Zihan Wang Chi Gui Xing Jin Qineng Wang Licheng Liu Kangrui Wang Shiqi Chen Linjie Li Zhengyuan Yang Pingyue Zhang6 more

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Zihan Wang Chi Gui Xing Jin Qineng Wang Licheng Liu Kangrui Wang Shiqi Chen Linjie Li Zhengyuan Yang Pingyue Zhang

Zihan Wang Chi Gui Xing Jin Qineng Wang Licheng Liu Kangrui Wang Shiqi Chen Linjie Li Zhengyuan Yang Pingyue Zhang

Zihan Wang Chi Gui Xing Jin Qineng Wang Licheng Liu Kangrui Wang Shiqi Chen Linjie Li Zhengyuan Yang Pingyue Zhang