HyperAIHyperAI

Command Palette

Search for a command to run...

Wie weit kann unsupervised RLVR das Training von LLM skalieren?

Zusammenfassung

Unüberwachtes Reinforcement Learning mit verifizierbaren Belohnungen (URLVR) bietet einen Weg, das Training von Large Language Models (LLMs) über die Engpässe der supervidierten Lernphase hinaus zu skalieren, indem Belohnungen ohne Ground-Truth-Labels abgeleitet werden. Jüngste Arbeiten nutzen intrinsische Signals des Modells und zeigen vielversprechende frühe Fortschritte; ihr Potenzial und ihre Grenzen bleiben jedoch unklar. In dieser Arbeit widmen wir uns URLVR erneut und liefern eine umfassende Analyse, die Taxonomie, Theorie und umfangreiche Experimente umfasst. Wir klassifizieren URLVR-Methoden zunächst nach der Quelle der Belohnung in intrinsische versus externe Ansätze und etablieren anschließend einen einheitlichen theoretischen Rahmen, der zeigt, dass alle intrinsischen Methoden darauf konvergieren, die anfängliche Verteilung des Modells zu verschärfen (zu schärfen). Dieser Schärfungsmechanismus funktioniert erfolgreich, wenn das anfängliche Vertrauen des Modells mit der Korrektheit übereinstimmt, scheitert jedoch katastrophal, wenn diese Übereinstimmung fehlt. Durch systematische Experimente zeigen wir, dass intrinsische Belohnungen über alle Methoden hinweg konsistent einem Muster des Anstiegs gefolgt von einem Abfall folgen; der Zeitpunkt des Zusammenbruchs wird dabei durch die Modellpriorität bestimmt und nicht durch ingenieurtechnische Entscheidungen. Trotz dieser Skalierungsgrenzen stellen wir fest, dass intrinsische Belohnungen im Testzeit-Training auf kleinen Datensätzen weiterhin wertvoll sind. Wir schlagen den „Model Collapse Step" als Maß für die Modellpriorität vor, der als praktischer Indikator für die Trainierbarkeit im Rahmen des Reinforcement Learning dient. Schließlich untersuchen wir externe Belohnungsmethoden, die die Verifizierung auf rechnerischen Asymmetrien gründen, und liefern erste Evidenz dafür, dass diese das „Vertrauens-Korrektheits-Obergrenzen"-Problem möglicherweise umgehen können. Unsere Ergebnisse ziehen Grenzen für intrinsisches URLVR und motivieren gleichzeitig Wege hin zu skalierbaren Alternativen.

One-sentence Summary

Researchers from Tsinghua University and collaborating institutes reveal that intrinsic unsupervised RLVR methods inevitably cause model collapse by sharpening initial distributions, proposing the Model Collapse Step metric to predict trainability while advocating external rewards for scalable LLM training.

Key Contributions

  • The paper establishes a taxonomy for unsupervised RLVR and provides a unified theoretical framework showing that all intrinsic reward methods converge toward sharpening the model's initial distribution rather than discovering new knowledge.
  • Extensive experiments reveal that intrinsic URLVR consistently follows a rise-then-fall pattern where performance collapses when the model's initial confidence misaligns with correctness, regardless of specific engineering choices.
  • The authors propose the Model Collapse Step metric to predict RL trainability and demonstrate that intrinsic rewards remain effective for test-time training on small datasets, while external rewards grounded in computational asymmetries offer a potential path to escape these scaling limits.

Introduction

Large language models currently rely on reinforcement learning with verifiable rewards to enhance reasoning, but this approach faces a critical bottleneck as obtaining human-verified ground truth labels becomes prohibitively expensive and infeasible for superintelligent systems. Unsupervised RLVR aims to solve this by deriving rewards without labels, yet prior work relying on intrinsic model signals suffers from a fundamental rise-then-fall pattern where training initially improves performance before collapsing due to reward hacking and model degradation. The authors provide a comprehensive theoretical and empirical analysis revealing that intrinsic methods merely sharpen the model's initial distribution, which fails when confidence misaligns with correctness, while proposing the Model Collapse Step as a practical metric to predict trainability and advocating for external reward methods that leverage computational asymmetries to achieve scalable, stable improvement.

Dataset

  • The training dataset consists of MMM prompt-answer pairs, where each entry includes a prompt xix_ixi and its corresponding ground-truth answer aia_i^*ai.
  • For every prompt in the set, the authors generate NNN rollout responses using the current policy πθ\pi_\thetaπθ.
  • Each generated response contains a full reasoning trajectory and an extracted answer derived from that trajectory.
  • The data serves as the foundation for training, where the model learns from the relationship between the initial prompts, the generated trajectories, and the verified ground-truth answers.

Method

The authors leverage a framework for Unsupervised Reinforcement Learning with Verifiable Rewards (URLVR) that eliminates the need for ground-truth labels by utilizing proxy intrinsic rewards generated solely by the model. This approach distinguishes between two primary paradigms for constructing rewards: Certainty-Based and Ensemble-Based methods.

Certainty-Based rewards derive signals from the policy's confidence, such as logits or entropy, operating on the assumption that higher confidence correlates with correctness. These methods include estimators like Self-Certainty, which measures the KL divergence from a uniform distribution, and Token-Level Entropy, which penalizes uncertainty at each generation step. Conversely, Ensemble-Based rewards leverage the wisdom of the crowd by generating multiple rollouts for the same prompt. They assume that consistency across these diverse candidate solutions, often formalized through majority voting or semantic clustering, serves as a robust proxy for correctness.

The underlying mechanism driving these methods is a sharpening process where the model converges towards its initial distribution. Theoretically, the training dynamics follow a KL-regularized RL objective. The optimal policy for this objective has the closed form:

πθ(yx)=1Z(x)πref(yx)exp(1βr(x,y))\pi _ { \theta } ^ { * } ( y | x ) = \frac { 1 } { Z ( x ) } \pi _ { \mathrm { r e f } } ( y | x ) \exp \left( \frac { 1 } { \beta } r ( x , y ) \right)πθ(yx)=Z(x)1πref(yx)exp(β1r(x,y))

where Z(x)Z(x)Z(x) is the partition function and β\betaβ controls regularization strength. For intrinsic rewards like majority voting, this creates a "rich-get-richer" dynamic. If the model's initial confidence aligns with correctness, the sharpening mechanism amplifies these correct predictions, leading to performance gains. However, if the initial confidence is misaligned, the same mechanism systematically reinforces errors, leading to a gradual collapse in performance.

To monitor this stability, the authors introduce the Model Collapse Step as an indicator of trainability. This metric tracks the training step where reward accuracy drops below a specific threshold, such as 1%. Models with stronger priors sustain intrinsic URLVR for longer periods before collapsing, allowing for efficient base model selection without the computational cost of full reinforcement learning training. This framework highlights the critical dependency of intrinsic URLVR success on the alignment between initial confidence and correctness.

Experiment

  • Intrinsic URLVR methods universally exhibit a rise-then-fall pattern where early gains from aligning confidence with correctness eventually collapse into reward hacking, regardless of hyperparameter tuning or specific reward design.
  • Fine-grained analysis reveals that training primarily amplifies the model's initial preferences rather than correcting errors on specific problems, yet this sharpening can still generalize to improve performance on unseen out-of-distribution tasks if initial confidence aligns with correctness.
  • Model collapse is prevented when training on small, domain-specific datasets or during test-time training, as these conditions induce localized overfitting rather than the systematic policy shifts that cause failure in large-scale training.
  • The "Model Collapse Step," measuring when reward accuracy drops during intrinsic training, serves as a rapid and accurate predictor of a model's potential for RL gains, outperforming static metrics like pass@k.
  • External reward methods leveraging generation-verification asymmetry, such as self-verification, offer a more scalable path than intrinsic rewards by providing signals grounded in computational procedures rather than the model's internal confidence.

KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren
Sofort einsatzbereite GPUs
Die besten Preise

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp