il y a 2 jours

Table des matières

Résumé

L’entropie constitue une métrique cruciale pour mesurer la diversité des sorties produites par les grands modèles linguistiques (LLM), offrant des perspectives précieuses sur leurs capacités d’exploration. Bien que les études récentes se concentrent de plus en plus sur la surveillance et l’ajustement de l’entropie afin d’optimiser le compromis entre exploration et exploitation dans le cadre du fine-tuning par renforcement (RFT), une compréhension théorique rigoureuse des dynamiques de l’entropie au cours de ce processus reste encore insuffisamment explorée. Dans cet article, nous établissons un cadre théorique pour analyser les dynamiques de l’entropie durant le processus de RFT, en partant d’une expression discriminante qui quantifie la variation de l’entropie sous une mise à jour unique d’un logit. Ce fondement permet d’obtenir une expression du premier ordre de la variation de l’entropie, qui peut être étendue à la formule de mise à jour de l’algorithme Group Relative Policy Optimization (GRPO). Les corollaires et les intuitions tirés de cette analyse théorique inspirent la conception de méthodes de contrôle de l’entropie, tout en offrant un cadre unifié pour interpréter diverses approches basées sur l’entropie présentes dans les travaux existants. Nous fournissons des preuves empiriques appuyant les principales conclusions de notre analyse, et démontrons l’efficacité des méthodes de découpage (clipping) basées sur le discriminant d’entropie. Cette étude apporte des perspectives nouvelles sur les dynamiques d’entraînement du RFT, offrant un soutien théorique et des stratégies pratiques pour optimiser le compromis exploration-exploitation lors du fine-tuning des LLM.

One-sentence Summary

Shumin Wang and Yanyong Zhang (Tsinghua) with collaborators propose a theoretical framework for entropy dynamics in LLM reinforcement fine-tuning, deriving a first-order entropy update formula applicable to GRPO, enabling novel entropy control methods that improve exploration-exploitation balance with empirical validation.

Key Contributions

We introduce a theoretical framework that quantifies entropy change at the token level during reinforcement fine-tuning, deriving a first-order expression extendable to Group Relative Policy Optimization (GRPO), revealing that entropy dynamics depend on token update direction and a discriminator score $S_{\star}$ .
Our analysis provides a unified interpretation of existing entropy-based methods and inspires new entropy control strategies, including clipping techniques grounded in the discriminant $S_{\star}$ , offering principled guidance for balancing exploration and exploitation.
Empirical results validate our theoretical predictions, demonstrating that $S_{\star}$ reliably indicates entropy trends and that our clipping methods effectively stabilize entropy during RFT, improving model exploration without compromising performance.

Introduction

The authors leverage entropy as a diagnostic tool to understand and control exploration-exploitation trade-offs during reinforcement fine-tuning (RFT) of large language models. While prior entropy-based methods often rely on heuristics and lack theoretical grounding—leading to inconsistent strategies and costly hyperparameter tuning—the authors derive a principled framework that quantifies how single-token logit updates propagate to entropy changes. They extend this to GRPO, revealing that entropy dynamics depend on the interplay between token probability, update direction, and policy entropy, which explains common entropy collapse. Their framework enables practical entropy clipping strategies and unifies the interpretation of existing entropy-based techniques.

Top Figure

Dataset

The authors use DAPO-Math-17k (Yu et al., 2025) as the primary training dataset, selecting 17,000 math problems for fine-tuning Qwen2.5-7B-Instruct and Qwen2.5-14B-Instruct models.
From DAPO-Math-17k, they reserve 500 questions as a validation set (DAPO500), following prior work (Lightman et al., 2023).
They filter training samples by excluding those with pass rates ≤ 1/16 or ≥ 15/16 when evaluated by Qwen2.5-7B-Instruct, ensuring moderate difficulty for effective training.
For testing, they use AIME24 and AIME25 — two challenging math datasets — and evaluate using Avg@32 and Pass@32 metrics.
For DAPO500 validation, they use Avg@8 and Pass@8 metrics, where Avg@K is the mean accuracy across K responses per question, and Pass@K is the probability that at least one of K responses is correct.

Method

The authors leverage a theoretical framework to characterize token-level entropy dynamics during policy optimization in Reinforcement Fine-Tuning (RFT), with a focus on Group Relative Policy Optimization (GRPO). Their analysis begins at the microscopic level—examining how a single token update alters the entropy of the next-token distribution—and extends to the full GRPO optimization step, enabling principled control over entropy evolution during training.

At the core of their method is the entropy discriminator score $S_*^t$ , defined for a token $a^k$ sampled at position $t$ as $S_*^t = p_k^t (H^t + \log p_k^t)$ , where $p_k^t$ is the token’s probability under the current policy and $H^t$ is the entropy of the full token distribution at that step. This score serves as a first-order predictor of entropy change: a positive update (reward) to a token increases entropy if $S_*^t < 0$ (i.e., the token is relatively low-probability) and decreases entropy if $S_*^t > 0$ (i.e., the token is high-probability). This relationship is derived from a Taylor expansion of entropy under a logit perturbation $\delta \mathbf{z} = \varepsilon \cdot \mathbf{e}_k$ , yielding $\Delta H = -\varepsilon S_* + O(\varepsilon^2)$ .

Extending this to GRPO, the authors model the entropy change induced by a full optimization step. Under GRPO, each token’s update is governed by a surrogate loss $\mathcal{L}(\mathbf{z}) = r \cdot A \cdot \log p_k(\mathbf{z})$ , where $r$ is the importance ratio and $A$ is the advantage. A gradient step with learning rate $\eta$ induces a logit update $\delta \mathbf{z} = \alpha (\mathbf{e}_k - \mathbf{p})$ , where $\alpha = \eta r A$ . Substituting this into the entropy gradient yields the key result: the first-order entropy change is $\Delta H = -\alpha \left( S_* - \mathbb{E}_{i \sim \mathbf{p}}[S_i] \right) + O(\alpha^2)$ . This reveals that entropy change is not determined by $S_*$ alone, but by its deviation from the policy-weighted expectation $\mathbb{E}_{i \sim \mathbf{p}}[S_i]$ , which acts as a dynamic baseline. This baseline ensures that, under on-policy sampling, the expected entropy change across the vocabulary or batch is zero—a property formalized in Corollaries 3.4 and 3.5.

Building on this, the authors propose two clipping methods to stabilize entropy during training. The first, Clip $_\mathcal{B}$ , operates at the batch level: for each token $t$ in a batch $\mathcal{T}_\mathcal{B}$ , it computes the batch mean $\bar{S}$ and standard deviation $\sigma$ of $S_*^t$ , then applies a mask $m_t = \mathbf{1}\{ -\mu^- \sigma \leq S_*^t - \bar{S} \leq \mu^+ \sigma \}$ to filter outlier tokens that drive extreme entropy fluctuations. The second, Clip $_\mathcal{V}$ , operates at the vocabulary level: for each token, it computes the centered score $S_c^t = S_*^t - \mathbb{E}_{i \sim \mathbf{p}_t}[S_i^t]$ , then applies a mask based on the batch standard deviation of these centered scores. Both methods require minimal computation—operating on scalar values—and can be seamlessly integrated into existing RFT pipelines.

The authors further demonstrate that existing entropy control methods—such as clipping mechanisms, entropy regularization, and probability-weighted updating—can be interpreted through the lens of their entropy dynamics framework. For instance, clipping in GRPO predominantly affects low-probability tokens, which tend to have $S_* - \mathbb{E}[S_i] < 0$ ; thus, clipping positive samples (which reward these tokens) tends to increase entropy, while clipping negative samples (which penalize them) tends to decrease it. Similarly, entropy regularization methods that update only high-entropy tokens implicitly target tokens with $S_* - \mathbb{E}[S_i] > 0$ , whose updates on positive samples decrease entropy. This unified perspective allows the authors to explain why certain methods promote exploration (by amplifying entropy-increasing updates) while others suppress it (by amplifying entropy-decreasing updates).

Finally, the authors extend their analysis to off-policy settings, showing that the same entropy dynamics hold when incorporating the importance ratio $r$ , with the entropy change factor becoming $r(S_* - \mathbb{E}_{i \sim \mathbf{p}}[S_i])$ . They also derive batch-level covariance expressions (Corollaries C.2 and C.2.1) that link entropy change to the covariance between advantage and the discriminator score deviation, providing a computable metric for monitoring entropy collapse during training. Their empirical results confirm that this covariance is predominantly negative, indicating that models tend to reinforce “safe” high-probability tokens, thereby suppressing exploration—a dynamic their clipping methods are designed to counteract.

Experiment

Empirical tests confirm that discriminator scores reliably predict entropy changes: positive scores reduce entropy in positive samples and increase it in negative ones, and vice versa for negative scores, validating theoretical claims.
Gradient masking experiments further support this relationship, showing entropy increases when entropy-reducing gradients are masked and decreases when entropy-increasing gradients are masked.
Clipping methods (Clip_B and Clip_V) effectively control entropy decay during training, allowing flexible adjustment via hyperparameter μ and preventing excessive entropy collapse.
Models trained with clipping methods outperform standard GRPO across datasets, preserving exploration and improving overall performance.
Analysis of Pass@K and Avg@K metrics reveals that clipping enhances both solution diversity (exploration) and pattern exploitation, broadening the range of solvable problems.
Distribution of problem pass rates shows clipping encourages balanced exploration, reducing extreme solve/fail outcomes and promoting moderate success across varied problems.
Experiments with PPO and multiple model architectures (Qwen3, Distilled-Llama, InternLM) confirm the generalizability of clipping methods in stabilizing training and improving performance.
For InternLM, clipping prevents training collapse and stabilizes gradients, highlighting its role in filtering outlier tokens and enhancing training robustness.

The authors use entropy-based clipping methods to selectively control token updates during reinforcement fine-tuning, which effectively stabilizes entropy and prevents its collapse. Results show that both Clip_B and Clip_V consistently improve model performance across multiple datasets and model sizes, particularly enhancing exploration as measured by Pass@K. These gains stem from encouraging broader solution diversity rather than over-relying on high-reward patterns, leading to more robust and stable training dynamics.

The authors use entropy-based clipping methods to selectively control token updates during reinforcement fine-tuning, which stabilizes entropy and improves model performance across multiple datasets and architectures. Results show that both Clip_B and Clip_V consistently outperform standard GRPO, particularly in maintaining exploration and preventing entropy collapse. These gains are observed across diverse models, including Qwen3, Distilled-Llama, and InternLM, confirming the generalizability of the approach.

The authors apply entropy control methods to PPO training and observe consistent performance gains across multiple datasets. Results show that both Clip_B and Clip_V outperform Vanilla PPO, with Clip_V achieving the highest scores on AIME24 and DAPO500. These improvements suggest that regulating token-level entropy during training enhances model exploration and overall effectiveness.

PDF source Voir le code

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

HyperAI

il y a 2 jours

Apprentissage Par Renforcement

LLM

Entraînement Du Modèle

Approche/Framework

Shumin Wang Yuexiang Xie Wenhao Zhang Yuchang Sun Yanxi Chen Yaliang Li Yanyong Zhang

Table des matières

Résumé

One-sentence Summary

Key Contributions

We introduce a theoretical framework that quantifies entropy change at the token level during reinforcement fine-tuning, deriving a first-order expression extendable to Group Relative Policy Optimization (GRPO), revealing that entropy dynamics depend on token update direction and a discriminator score $S_{\star}$ .
Our analysis provides a unified interpretation of existing entropy-based methods and inspires new entropy control strategies, including clipping techniques grounded in the discriminant $S_{\star}$ , offering principled guidance for balancing exploration and exploitation.
Empirical results validate our theoretical predictions, demonstrating that $S_{\star}$ reliably indicates entropy trends and that our clipping methods effectively stabilize entropy during RFT, improving model exploration without compromising performance.

Introduction

Top Figure

Dataset

The authors use DAPO-Math-17k (Yu et al., 2025) as the primary training dataset, selecting 17,000 math problems for fine-tuning Qwen2.5-7B-Instruct and Qwen2.5-14B-Instruct models.
From DAPO-Math-17k, they reserve 500 questions as a validation set (DAPO500), following prior work (Lightman et al., 2023).
They filter training samples by excluding those with pass rates ≤ 1/16 or ≥ 15/16 when evaluated by Qwen2.5-7B-Instruct, ensuring moderate difficulty for effective training.
For testing, they use AIME24 and AIME25 — two challenging math datasets — and evaluate using Avg@32 and Pass@32 metrics.
For DAPO500 validation, they use Avg@8 and Pass@8 metrics, where Avg@K is the mean accuracy across K responses per question, and Pass@K is the probability that at least one of K responses is correct.

Method

Experiment

Empirical tests confirm that discriminator scores reliably predict entropy changes: positive scores reduce entropy in positive samples and increase it in negative ones, and vice versa for negative scores, validating theoretical claims.
Gradient masking experiments further support this relationship, showing entropy increases when entropy-reducing gradients are masked and decreases when entropy-increasing gradients are masked.
Clipping methods (Clip_B and Clip_V) effectively control entropy decay during training, allowing flexible adjustment via hyperparameter μ and preventing excessive entropy collapse.
Models trained with clipping methods outperform standard GRPO across datasets, preserving exploration and improving overall performance.
Analysis of Pass@K and Avg@K metrics reveals that clipping enhances both solution diversity (exploration) and pattern exploitation, broadening the range of solvable problems.
Distribution of problem pass rates shows clipping encourages balanced exploration, reducing extreme solve/fail outcomes and promoting moderate success across varied problems.
Experiments with PPO and multiple model architectures (Qwen3, Distilled-Llama, InternLM) confirm the generalizability of clipping methods in stabilizing training and improving performance.
For InternLM, clipping prevents training collapse and stabilizes gradients, highlighting its role in filtering outlier tokens and enhancing training robustness.

PDF source Voir le code

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

Command Palette

Sur la dynamique de l'entropie dans le fine-tuning par renforcement des grands modèles linguistiques

Shumin Wang Yuexiang Xie Wenhao Zhang Yuchang Sun Yanxi Chen Yaliang Li Yanyong Zhang

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

Sur la dynamique de l'entropie dans le fine-tuning par renforcement des grands modèles linguistiques

Shumin Wang Yuexiang Xie Wenhao Zhang Yuchang Sun Yanxi Chen Yaliang Li Yanyong Zhang

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

Sur la dynamique de l'entropie dans le fine-tuning par renforcement des grands modèles linguistiques

Shumin Wang Yuexiang Xie Wenhao Zhang Yuchang Sun Yanxi Chen Yaliang Li Yanyong Zhang

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters