vor 6 Stunden

Zihan Wang Chi Gui Xing Jin Qineng Wang Licheng Liu Kangrui Wang Shiqi Chen Linjie Li Zhengyuan Yang Pingyue Zhang

Inhaltsverzeichnis

Zusammenfassung

Da Sie mich gebeten haben, die Übersetzung unter Einhaltung Ihrer strengen Standards für technisches Fachdeutsch (obwohl der Ausgangstext Englisch ist und die Zielvorgabe „Deutsch“ lautet), präsentiere ich Ihnen hiermit die präzise Übersetzung des Textes in einem wissenschaftlichen Stil:Das RL-Training von Multi-Turn LLM Agents ist von Natur aus instabil, wobei die Qualität des Reasoning direkt die Task-Performance bestimmt. Entropie wird häufig verwendet, um die Stabilität des Reasoning zu verfolgen. Die Entropie misst jedoch lediglich die Diversität innerhalb desselben Inputs und kann nicht aussagen, ob das Reasoning tatsächlich auf unterschiedliche Inputs reagiert. In RAGEN-2 stellen wir fest, dass Modelle selbst bei stabiler Entropie auf festen Templates zurückgreifen können, die zwar divers erscheinen, aber input-agnostisch sind. Wir bezeichnen dies als „Template Collapse“ – ein Failure Mode, der für die Entropie und alle existierenden Metriken unsichtbar bleibt.Um diesen Fehler zu diagnostizieren, zerlegen wir die Reasoning-Qualität in die Within-Input-Diversität (Entropie) und die Cross-Input-Unterscheidbarkeit (Mutual Information, MI) und führen eine Familie von Mutual Information Proxies für die Online-Diagnose ein. Über verschiedene Tasks hinweg korreliert die Mutual Information wesentlich stärker mit der finalen Performance als die Entropie, was sie zu einem zuverlässigeren Proxy für die Reasoning-Qualität macht.Des Weiteren erklären wir den Template Collapse durch einen Signal-to-Noise Ratio (SNR)-Mechanismus. Eine geringe Reward-Varianz schwächt die Task-Gradienten ab, was dazu führt, dass Regularisierungsterme dominieren und die Cross-Input-Unterschiede im Reasoning eliminieren. Um dies zu adressieren, schlagen wir „SNR-Aware Filtering“ vor, um pro Iteration High-Signal-Prompts unter Verwendung der Reward-Varianz als leichtgewichtigen Proxy auszuwählen. Über die Bereiche Planning, Math Reasoning, Web Navigation und Code Execution hinweg verbessert diese Methode konsistent sowohl die Input-Abhängigkeit als auch die Task-Performance.

One-sentence Summary

By identifying template collapse as a failure mode where agentic reinforcement learning models adopt input-agnostic reasoning patterns despite stable entropy, the RAGEN-2 study proposes using mutual information proxies for diagnosis and introduces SNR-Aware Filtering to improve performance across planning, math reasoning, web navigation, and code execution tasks.

Key Contributions

The paper introduces the concept of template collapse, a failure mode in multi-turn LLM agent training where models rely on input-agnostic reasoning templates that appear diverse but do not respond to specific inputs.
This work presents a new diagnostic framework that decomposes reasoning quality into within-input diversity and cross-input distinguishability through a family of mutual information proxies. These proxies demonstrate a stronger correlation with final task performance than traditional entropy metrics.
The researchers propose SNR-Aware Filtering, a method that uses reward variance as a proxy for signal strength to select high-signal prompts during training. Experiments across planning, math, web navigation, and code execution show that this approach improves both input dependence and overall task performance.

Introduction

Training multi-turn LLM agents using reinforcement learning is a critical task for developing autonomous reasoning systems, but it is inherently unstable. While researchers typically use entropy to monitor reasoning stability, entropy only measures diversity within a single input and fails to detect when a model begins to rely on fixed, input-agnostic templates. The authors identify this phenomenon as template collapse, a failure mode where reasoning appears diverse but loses its dependence on the specific input. To address this, the authors leverage a mutual information (MI) proxy to diagnose input dependence and introduce SNR-Aware Filtering, which uses reward variance to select high-signal prompts and maintain effective task gradients during training.

Dataset

Dataset Composition and Sources: The authors utilize a diverse testbed of seven synthetic, fully controllable environments designed to evaluate various decision-making regimes. The environments include Sokoban (grid puzzles), FrozenLake (navigation), MetaMathQA (mathematical reasoning), Countdown (arithmetic games), SearchQA (multi-turn search), WebShop (e-commerce navigation), and DeepCoder (program synthesis). DeepCoder specifically draws from PrimeIntellect, TACO, and LiveCodeBench v5.
Key Subset Details:
- Sokoban: Uses procedurally generated puzzles with configurable dimensions and box counts to study irreversible planning.
- FrozenLake: A navigation task featuring a 2% random transition rate to simulate stochastic dynamics and sparse rewards.
- MetaMathQA: A math QA task where correctness is determined by exact matches with ground truth.
- Countdown: A compositional arithmetic task where agents must construct expressions to reach a target number.
- SearchQA: A multi-turn environment requiring iterative web search and information synthesis.
- WebShop: An interactive e-commerce simulation with a large action space and realistic product catalogs.
- DeepCoder: A coding benchmark where agents generate Python functions to pass specific test cases.
Training and Usage: The authors train a Qwen2.5-3B model using the veRL/HybridFlow stack. The training process involves comparing different RL algorithms, including PPO, DAPO, GRPO, and Dr. GRPO, for up to 400 rollout-update iterations. In each iteration, the model collects 128 trajectories per environment using a prompt batch size of 8 and a group size of 16 trajectories per prompt.
Processing and Reward Engineering:
- Reward Shaping: The authors apply specific reward structures to guide learning, such as a diminishing reward scheme for MetaMathQA (halving the reward for each subsequent retry) and multi-tier rewards for Countdown based on format and solution correctness.
- SNR-Aware Filtering: When applying this filtering technique, the authors reduce the effective minibatch size by the keep rate and scale the per-step loss accordingly to maintain a comparable optimization step size.

Method

The authors address the challenge of template collapse in closed-loop multi-turn agent reinforcement learning, where a policy $\pi_{\theta}$ generates reasoning tokens $z_t$ and executable actions $a_t$ in response to observations $o_t$ , forming trajectories $\tau = \{o_t, z_t, a_t, r_t\}_{t=1}^{T}$ . A key insight is that standard reinforcement learning objectives, such as PPO or GRPO, apply uniform regularization (e.g., KL divergence, entropy bonus) across all inputs, which can inadvertently promote input-agnostic reasoning. This phenomenon is characterized by a low mutual information $I(X;Z)$ between the input prompt $X$ and the generated reasoning $Z$ , indicating the model fails to adapt its reasoning to the specific problem. The authors formalize this problem through a signal-to-noise ratio (SNR) analysis of policy gradients, identifying low within-prompt reward variance as the primary cause of template collapse.

Four Quandrants of Reasoning on Entropy H and MI I

The framework for understanding reasoning regimes is established by analyzing two key dimensions: within-input diversity, measured by conditional entropy $H(Z \mid X)$ , and input dependence, measured by mutual information $I(X;Z)$ . As shown in the figure above, these two axes define four distinct reasoning regimes. High $H(Z \mid X)$ and high $I(X;Z)$ correspond to diverse and input-grounded reasoning, where the model adapts its thought process to the specific input. Conversely, low $H(Z \mid X)$ and low $I(X;Z)$ define a "Low-Entropy Collapse" regime, characterized by deterministic, template-like responses that are weakly input-grounded. The authors argue that the standard practice of using entropy regularization to increase $H(Z \mid X)$ can be counterproductive, as it may not increase $I(X;Z)$ and can even cause it to decrease, as formalized in Theorem M.2. The core mechanism of template collapse is a dominance of reward-agnostic regularization over task-relevant signal, which is particularly pronounced on prompts with low reward variance.

Task-Relevant and Task-Agnostic Gradients

The authors provide a detailed gradient-level explanation of this phenomenon, illustrated in the figure above. In high-SNR regimes, the task gradient $g_{\text{task}}$ is strong and distinct, representing a clear signal to improve the policy. This strong signal is amplified by the reward variance, as shown by the Cauchy-Schwarz inequality bound in Theorem H.2. The regularization gradient $g_{\text{reg}}$ acts as a noise term, but its influence is outweighed by the strong task signal. In contrast, on low-RV prompts, the task gradient $g_{\text{task}}$ weakens significantly, while the regularization gradient $g_{\text{reg}}$ remains constant. This leads to a situation where the total update is dominated by the input-agnostic regularization noise, pushing the policy towards a state of low mutual information $I(X;Z)$ . This is visualized as a weak, ambiguous direction for the update gradient $g$ , which can lead to policy drift away from the optimal policy $\theta^*$ .

To mitigate this issue, the authors propose a method called SNR-Aware Filtering. This approach directly addresses the root cause by selecting training examples based on their signal quality. The workflow, depicted in the figure above, operates in three steps. First, during sampling, the policy generates multiple trajectories for each prompt $x$ . Second, the within-prompt reward variance $\text{RV}(x)$ is computed for each prompt as a proxy for the task signal strength. Prompts with low variance are identified as having weak signal. Third, a filtering mechanism is applied to retain only the high-signal prompts. The authors use a top- $p$ filtering strategy, which ranks prompts by descending reward variance and retains the smallest prefix whose cumulative variance mass reaches a fraction $\rho$ of the total variance mass. This adaptive selection ensures that the policy update is concentrated on high-SNR prompts, effectively filtering out low-variance rollouts that would be dominated by input-agnostic regularization. This process prevents the degradation of $I(X;Z)$ and restores input-conditioned reasoning.

Experiment

The experiments evaluate the phenomenon of template collapse in reinforcement learning agents by analyzing gradient dynamics and mutual information across various tasks, algorithms, and model scales. The results demonstrate that low reward variance causes task-discriminative gradients to be overwhelmed by input-agnostic regularization, leading to reasoning that is fluent but ignores input specifics. Implementing SNR-Aware Top-p filtering consistently improves task performance and preserves information content by prioritizing high-signal updates, proving more effective than entropy-based diagnostics or regularization alone.

The the the table outlines key characteristics of different environments used in the experiments, including their stochasticity, turn structure, state representation, and reward type. These features help categorize the tasks and inform the experimental setup. Environments vary in stochasticity, with some being stochastic and others deterministic. Multi-turn tasks involve multiple interaction steps, while single-turn tasks have a single step. State representations differ between grid-based and text-based formats, and reward types are either dense or binary.

The authors compare different intervention strategies in reinforcement learning training, showing that SNR-Aware Filtering preserves task performance and reasoning diversity while preventing the decline in mutual information that occurs with no filtering. Without filtering, task success drops sharply after an initial peak, retrieval accuracy declines, and reasoning entropy increases, indicating template collapse. In contrast, filtering maintains stable retrieval accuracy and low entropy throughout training. SNR-Aware Filtering prevents the decline in task performance and retrieval accuracy seen in the no-filter baseline. Without filtering, reasoning entropy increases significantly, signaling a loss of input-specific reasoning. Filtering maintains stable mutual information and reasoning diversity, avoiding the degradation observed in unfiltered training.

The experiment evaluates the impact of reward variance (RV) on model performance by grouping prompts into quartiles based on RV. Results show that task performance and mutual information (MI) proxy decrease monotonically as RV decreases, indicating that higher reward variance correlates with better learning outcomes. The lowest RV quartile exhibits the weakest task performance and MI, while the highest RV quartile achieves the best results. Task performance and MI proxy decline monotonically across quartiles as reward variance decreases. The highest reward variance quartile achieves the best task performance and MI proxy. The lowest reward variance quartile shows the weakest task performance and MI proxy.

The authors evaluate SNR-Aware Filtering across various RL algorithms, model scales, and input modalities. Results show that filtering consistently improves peak task success rates across different configurations, demonstrating its effectiveness as a general-purpose method to enhance learning efficiency. SNR-Aware Filtering improves performance across multiple RL algorithms and model scales. The method consistently increases peak task success rates in most experimental settings. Gains are observed across different input modalities, including text and image-conditioned models.

SNR-Aware Filtering improves performance

The the the table outlines different mutual information proxy metrics used to assess reasoning quality in agent training. These proxies vary in their formulation and computational approach, with some focusing on retrieval accuracy and others on normalized scores or entropy-based estimates. The metrics are designed to detect template collapse by measuring input dependence and reasoning diversity. The the the table lists multiple MI proxy metrics, including retrieval-based and normalized continuous scores, to evaluate reasoning quality. Different proxies use distinct formulas, such as argmax-based selection or marginal differences, to estimate mutual information. Some proxies, like MI-ZScore-EMA, incorporate smoothing and normalization to track MI dynamics more robustly during training.

The experiments evaluate various reinforcement learning environments and intervention strategies to assess their impact on task performance and reasoning diversity. The results demonstrate that SNR-Aware Filtering prevents template collapse and maintains stable reasoning quality, whereas unfiltered training leads to a decline in mutual information and retrieval accuracy. Furthermore, the findings show that higher reward variance correlates with improved learning outcomes and that the filtering method consistently enhances peak success rates across different algorithms, model scales, and input modalities.

Quell-PDF Code anzeigen

Inhaltsverzeichnis

KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren

Sofort einsatzbereite GPUs

Die besten Preise

Erste Schritte Preise anzeigen

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates

Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen

Unterstützt von MailChimp

vor 6 Stunden

Zihan Wang Chi Gui Xing Jin Qineng Wang Licheng Liu Kangrui Wang Shiqi Chen Linjie Li Zhengyuan Yang Pingyue Zhang

Inhaltsverzeichnis

Zusammenfassung

One-sentence Summary

Key Contributions

The paper introduces the concept of template collapse, a failure mode in multi-turn LLM agent training where models rely on input-agnostic reasoning templates that appear diverse but do not respond to specific inputs.
This work presents a new diagnostic framework that decomposes reasoning quality into within-input diversity and cross-input distinguishability through a family of mutual information proxies. These proxies demonstrate a stronger correlation with final task performance than traditional entropy metrics.
The researchers propose SNR-Aware Filtering, a method that uses reward variance as a proxy for signal strength to select high-signal prompts during training. Experiments across planning, math, web navigation, and code execution show that this approach improves both input dependence and overall task performance.

Introduction

Dataset

Dataset Composition and Sources: The authors utilize a diverse testbed of seven synthetic, fully controllable environments designed to evaluate various decision-making regimes. The environments include Sokoban (grid puzzles), FrozenLake (navigation), MetaMathQA (mathematical reasoning), Countdown (arithmetic games), SearchQA (multi-turn search), WebShop (e-commerce navigation), and DeepCoder (program synthesis). DeepCoder specifically draws from PrimeIntellect, TACO, and LiveCodeBench v5.
Key Subset Details:
- Sokoban: Uses procedurally generated puzzles with configurable dimensions and box counts to study irreversible planning.
- FrozenLake: A navigation task featuring a 2% random transition rate to simulate stochastic dynamics and sparse rewards.
- MetaMathQA: A math QA task where correctness is determined by exact matches with ground truth.
- Countdown: A compositional arithmetic task where agents must construct expressions to reach a target number.
- SearchQA: A multi-turn environment requiring iterative web search and information synthesis.
- WebShop: An interactive e-commerce simulation with a large action space and realistic product catalogs.
- DeepCoder: A coding benchmark where agents generate Python functions to pass specific test cases.
Training and Usage: The authors train a Qwen2.5-3B model using the veRL/HybridFlow stack. The training process involves comparing different RL algorithms, including PPO, DAPO, GRPO, and Dr. GRPO, for up to 400 rollout-update iterations. In each iteration, the model collects 128 trajectories per environment using a prompt batch size of 8 and a group size of 16 trajectories per prompt.
Processing and Reward Engineering:
- Reward Shaping: The authors apply specific reward structures to guide learning, such as a diminishing reward scheme for MetaMathQA (halving the reward for each subsequent retry) and multi-tier rewards for Countdown based on format and solution correctness.
- SNR-Aware Filtering: When applying this filtering technique, the authors reduce the effective minibatch size by the keep rate and scale the per-step loss accordingly to maintain a comparable optimization step size.

Method

Experiment

Quell-PDF Code anzeigen

Inhaltsverzeichnis

KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren

Sofort einsatzbereite GPUs

Die besten Preise

Erste Schritte Preise anzeigen

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates

Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen

Unterstützt von MailChimp

Command Palette

RAGEN-2: Reasoning Collapse in Agentic RL

Zihan Wang Chi Gui Xing Jin Qineng Wang Licheng Liu Kangrui Wang Shiqi Chen Linjie Li Zhengyuan Yang Pingyue Zhang6 more

Zusammenfassung

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

KI mit KI entwickeln

HyperAI Newsletters

Command Palette

RAGEN-2: Reasoning Collapse in Agentic RL

Zihan Wang Chi Gui Xing Jin Qineng Wang Licheng Liu Kangrui Wang Shiqi Chen Linjie Li Zhengyuan Yang Pingyue Zhang6 more

Zusammenfassung

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

KI mit KI entwickeln

HyperAI Newsletters

Command Palette

RAGEN-2: Reasoning Collapse in Agentic RL

Zihan Wang Chi Gui Xing Jin Qineng Wang Licheng Liu Kangrui Wang Shiqi Chen Linjie Li Zhengyuan Yang Pingyue Zhang6 more

Zusammenfassung

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

KI mit KI entwickeln

HyperAI Newsletters

Zihan Wang Chi Gui Xing Jin Qineng Wang Licheng Liu Kangrui Wang Shiqi Chen Linjie Li Zhengyuan Yang Pingyue Zhang

Zihan Wang Chi Gui Xing Jin Qineng Wang Licheng Liu Kangrui Wang Shiqi Chen Linjie Li Zhengyuan Yang Pingyue Zhang

Zihan Wang Chi Gui Xing Jin Qineng Wang Licheng Liu Kangrui Wang Shiqi Chen Linjie Li Zhengyuan Yang Pingyue Zhang