HyperAIHyperAI

Command Palette

Search for a command to run...

Vielfalt oder Präzision? Ein tiefgehender Einblick in die Vorhersage des nächsten Tokens

Haoyuan Wu Hai Wang Jiajia Wu Jinxiang Ou Keyao Wang Weile Chen Zihao Zheng Bei Yu

Abstract

Neuere Fortschritte haben gezeigt, dass die Reaktionsfähigkeit großer Sprachmodelle (LLMs) durch Verstärkendes Lernen (Reinforcement Learning, RL) erheblich verbessert werden kann. Die Effektivität solcher RL-Trainings hängt jedoch entscheidend vom Explorationsraum ab, der durch die Token-Ausgabeverteilung des vortrainierten Modells definiert wird. In diesem Artikel überprüfen wir die herkömmliche Kreuzentropieverlustfunktion erneut und deuten sie als einen spezifischen Fall der Policy-Gradient-Optimierung innerhalb eines einstufigen Episodenprozesses. Um systematisch zu untersuchen, wie die vortrainierte Verteilung die Explorationspotenziale für nachfolgendes RL beeinflusst, schlagen wir ein verallgemeinertes Vortrainingsziel vor, das Prinzipien des on-policy RL in das überwachte Lernen integriert. Indem wir die Vorhersage des nächsten Tokens als stochastischen Entscheidungsprozess formulieren, führen wir eine Belohnungsformung ein, die explizit zwischen Vielfalt und Präzision ausbalanciert. Unsere Methode verwendet einen positiven Belohnungsskalierungsfaktor, um die Wahrscheinlichkeitskonzentration auf die korrekten Tokens zu steuern, sowie einen rangbewussten Mechanismus, der hochrangige und niedrigrangige negative Tokens asymmetrisch behandelt. Dadurch können wir die vortrainierte Token-Ausgabeverteilung neu gestalten und untersuchen, wie ein günstigerer Explorationsraum für RL geschaffen werden kann, was letztlich die Leistung bei der end-to-end-Reasoning verbessert. Im Gegensatz zur intuitiven Annahme, dass eine höhere Entropie der Verteilung eine effektive Exploration fördert, zeigen wir, dass eine präzisionsorientierte Priorität einen überlegenen Explorationsraum für RL ermöglicht.

One-sentence Summary

The authors from Tencent and The Chinese University of Hong Kong propose a generalized pre-training objective that adapts on-policy RL principles to supervised learning, using reward shaping with positive scaling and rank-aware mechanisms to balance diversity and precision in token distributions, thereby creating a more favorable exploration space for RL and improving end-to-end reasoning performance, contrary to the conventional belief that higher entropy aids exploration.

Key Contributions

  • We reinterpret the standard cross-entropy loss as a single-step policy gradient optimization, enabling a systematic study of how the pre-trained model's token-output distribution shapes the exploration space for downstream reinforcement learning.

  • We propose a generalized pre-training objective that integrates on-policy RL principles into supervised learning, using a reward scaling factor to control probability concentration on ground-truth tokens and a rank-aware mechanism to asymmetrically suppress high- and low-ranking negative tokens.

  • Experiments show that a precision-oriented prior—contrary to the common belief that high entropy aids exploration—leads to a more effective exploration space, significantly improving end-to-end reasoning performance in subsequent RL stages.

Introduction

The authors investigate the role of next-token prediction in shaping the behavior of large language models during reinforcement learning (RL), particularly in reasoning tasks. While prior work has treated pre-training as a supervised process separate from RL, the authors highlight that cross-entropy loss—commonly used in pre-training—can be interpreted as a form of on-policy policy gradient optimization, establishing a theoretical link between pre-training and RL. This insight reveals that the output distribution from pre-training, shaped by the reward structure of the loss, implicitly defines the model’s exploration space during subsequent RL, influencing which reasoning paths are pursued. However, traditional pre-training methods fix the reward structure to favor precision by concentrating probability mass on the ground-truth token, potentially limiting diversity and constraining exploration. The authors’ main contribution is a generalized pre-training objective that explicitly controls the trade-off between diversity and precision through reward shaping: they introduce a positive scaling factor to modulate the concentration of probability on the correct token and apply asymmetric suppression to negative tokens based on their rank. This approach allows systematic tuning of the pre-training policy, and counterintuitively, they find that a precision-oriented prior—rather than high entropy—leads to better RL exploration and improved end-to-end reasoning performance.

Top Figure

Method

The authors leverage a policy-gradient framework to reinterpret the standard cross-entropy loss in next-token prediction as a specific instance of reinforcement learning (RL) optimization within a single-step episode. This perspective enables a systematic investigation into how the pre-trained model’s token-output distribution shapes the exploration space for subsequent RL training. The core idea is to treat the generation of each token as an independent decision-making process, where the language model acts as a stochastic policy πθ\pi_\thetaπθ that selects the next token from the vocabulary VVV based on the current context st=X<ts_t = X_{<t}st=X<t. The training objective is formulated as maximizing the expected immediate reward, which simplifies to a single-step return Jt(θst)=Eatπθ(st)[r(st,at)]J_t(\theta \mid s_t) = \mathbb{E}_{a_t \sim \pi_\theta(\cdot \mid s_t)}[r(s_t, a_t)]Jt(θst)=Eatπθ(st)[r(st,at)]. This formulation ensures that the reward depends solely on the immediate state-action pair, aligning with the policy gradient derivation for episodic tasks.

Within this framework, the standard cross-entropy objective is expressed as a policy gradient with an intrinsic reward function rCE(st,at)=sg(1(at=xt)πθ(atst))r_{\text{CE}}(s_t, a_t) = \text{sg}\left( \frac{\mathbb{1}(a_t = x_t)}{\pi_\theta(a_t \mid s_t)} \right)rCE(st,at)=sg(πθ(atst)1(at=xt)), where the stop-gradient operator sg()\text{sg}(\cdot)sg() ensures that the reward depends only on the current step. This reward structure implicitly balances diversity and precision: high-probability ground-truth tokens receive a large reward inversely proportional to their probability, while all negative tokens are assigned zero reward. The authors generalize this objective by introducing a reward-shaping strategy that explicitly controls the trade-off between diversity and precision through two mechanisms.

First, a positive reward scaling factor is introduced to modulate the influence of the ground-truth token. The modified positive reward is defined as rˉpos(st,at)=sg((1πθ(atst))(1πθ(atst))β)\bar{r}_{\text{pos}}(s_t, a_t) = \text{sg}\left( \left( \frac{1}{\pi_\theta(a_t \mid s_t)} \right)^{(1 - \pi_\theta(a_t \mid s_t))^\beta} \right)rˉpos(st,at)=sg((πθ(atst)1)(1πθ(atst))β), where β\betaβ controls the global entropy of the distribution. When β<0\beta < 0β<0, the reward is amplified, leading to aggressive concentration of probability mass on the ground truth and reduced entropy. Conversely, β>0\beta > 0β>0 attenuates the reward, allowing the model to maintain a flatter distribution with higher entropy. This mechanism enables fine-grained control over the global exploration potential during pre-training.

Second, the authors propose a rank-aware negative reward mechanism to regulate local entropy. Let Kt\mathcal{K}_tKt denote the set of top-kkk predicted tokens. Negative rewards are assigned asymmetrically: high-ranking negative tokens (those in Kt\mathcal{K}_tKt but not the ground truth) receive a reward λ~\tilde{\lambda}λ~, while low-ranking negative tokens (outside Kt\mathcal{K}_tKt) receive a reward λ^\hat{\lambda}λ^. This design prevents overconfidence in the ground truth by preserving probability mass for plausible alternatives, while simultaneously suppressing the tail of the distribution to encourage concentration on the most likely candidates. The generalized reward function combines these components as rˉ(st,at)=rˉpos(st,at)1(at=xt)+rˉneg(st,at)1(atxt)\bar{r}(s_t, a_t) = \bar{r}_{\text{pos}}(s_t, a_t) \cdot \mathbb{1}(a_t = x_t) + \bar{r}_{\text{neg}}(s_t, a_t) \cdot \mathbb{1}(a_t \neq x_t)rˉ(st,at)=rˉpos(st,at)1(at=xt)+rˉneg(st,at)1(at=xt), where rˉneg(st,at)=λ~1(atKtatxt)+λ^1(atKtatxt)\bar{r}_{\text{neg}}(s_t, a_t) = \tilde{\lambda} \cdot \mathbb{1}(a_t \in \mathcal{K}_t \land a_t \neq x_t) + \hat{\lambda} \cdot \mathbb{1}(a_t \notin \mathcal{K}_t \land a_t \neq x_t)rˉneg(st,at)=λ~1(atKtat=xt)+λ^1(at/Ktat=xt).

This reward-shaping strategy is embedded within a supervised learning objective, effectively adapting on-policy RL principles to pre-training. The resulting generalized loss function allows for the systematic exploration of how reshaping the token-output distribution during pre-training influences the subsequent RL stage. The authors demonstrate that precision-oriented priors—characterized by reduced global entropy and suppression of low-probability tokens—yield a more favorable exploration space for RL, ultimately enhancing end-to-end reasoning performance. The framework is compatible with architectures that perform iterative internal computation before token emission, such as latent-reasoning models and loop transformers, and can serve as an uncertainty-aware learning signal to guide adaptive computation policies.

Experiment

  • Pre-training evaluation validates the effectiveness of the generalized training objective in balancing diversity and precision across dense and MoE architectures, with lower perplexity and stable convergence observed across models.
  • Mid-training experiments demonstrate that setting β = -0.25 consistently improves performance on knowledge and reasoning tasks over the baseline, while local entropy control via λ^=0.1,λ~=0\hat{\lambda} = -0.1, \tilde{\lambda} = 0λ^=0.1,λ~=0 enhances scaling behavior.
  • Reinforcement learning results show that global low-entropy settings (β = -0.25) and local high-entropy configurations (λ^=0.1,λ~=0\hat{\lambda} = -0.1, \tilde{\lambda} = 0λ^=0.1,λ~=0) yield superior performance on mathematics and coding benchmarks, with higher Avg@128, Cons@128, and Pass@64 scores.
  • Pass@k analysis confirms that prioritizing precision, rather than global diversity, leads to higher upper-bound performance in mathematics and code generation, with stable output diversity maintained.
  • On MATH-500 and OlympiadBench, the 10B-A0.5B MoE model with β = -0.25 achieved Pass@64 of 68.4 and 52.1, surpassing the baseline by 12.3 and 9.7 points respectively.
  • On HumanEval+, the 10B-A0.5B MoE model with λ^=0.1,λ~=0\hat{\lambda} = -0.1, \tilde{\lambda} = 0λ^=0.1,λ~=0 achieved Pass@64 of 74.2, outperforming the baseline by 8.5 points.

The authors use a range of model architectures, including dense and MoE models, to evaluate the impact of different reward configurations on training dynamics and performance. Results show that configurations promoting precision, such as setting β = -0.25 or using local entropy control with λ̂ = -0.1, consistently yield superior performance across pre-training, mid-training, and reinforcement learning stages.

The authors use a range of benchmarks to evaluate model performance across general knowledge, commonsense reasoning, logic reasoning, mathematics, and coding tasks. Results show that the configuration with β = -0.25 consistently achieves the highest scores across most evaluation metrics, particularly in mathematics and coding, indicating that promoting precision through lower global entropy leads to superior performance.

The authors use a range of benchmarks to evaluate base models across general knowledge, commonsense reasoning, logic reasoning, mathematics, and coding. Results show that the configuration with β = 0, λ̃ = 0, and λ̂ = 0 consistently achieves the highest performance across most tasks, particularly in mathematics and coding, where it attains the best Pass@64 scores.

The authors use the Pass@64 metric to evaluate the performance of a 4B dense model during reinforcement learning training on mathematical reasoning tasks. Results show that the model's Pass@64 score increases steadily from 18.31 at 100 RL steps to 16.58 at 1000 steps, indicating improved capability in generating correct solutions over time. The average Pass@64 across all benchmarks reaches 49.59 at the final step, demonstrating significant progress in reasoning performance.

The authors use a range of benchmarks to evaluate model performance across general knowledge, commonsense reasoning, logic reasoning, mathematics, and coding. Results show that configurations with lower global entropy (β = -0.25) consistently achieve higher performance across most tasks, particularly in mathematics and coding, while maintaining sufficient output diversity. The data further indicate that strategies promoting precision, either globally or locally, lead to better performance and scaling behavior, especially in larger models.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp