HyperAI超神経

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin
公開日: 6/3/2025
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective
  Reinforcement Learning for LLM Reasoning
要約

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as apowerful approach to enhancing the reasoning capabilities of Large LanguageModels (LLMs), while its mechanisms are not yet well understood. In this work,we undertake a pioneering exploration of RLVR through the novel perspective oftoken entropy patterns, comprehensively analyzing how different tokensinfluence reasoning performance. By examining token entropy patterns inChain-of-Thought (CoT) reasoning, we observe that only a small fraction oftokens exhibit high entropy, and these tokens act as critical forks that steerthe model toward diverse reasoning pathways. Furthermore, studying how entropypatterns evolve during RLVR training reveals that RLVR largely adheres to thebase model's entropy patterns, primarily adjusting the entropy of high-entropytokens. These findings highlight the significance of high-entropy tokens (i.e.,forking tokens) to RLVR. We ultimately improve RLVR by restricting policygradient updates to forking tokens and uncover a finding even beyond the 80/20rule: utilizing only 20% of the tokens while maintaining performance comparableto full-gradient updates on the Qwen3-8B base model and significantlysurpassing full-gradient updates on the Qwen3-32B (+11.04 on AIME'25 and +7.71on AIME'24) and Qwen3-14B (+4.79 on AIME'25 and +5.21 on AIME'24) base models,highlighting a strong scaling trend. In contrast, training exclusively on the80% lowest-entropy tokens leads to a marked decline in performance. Thesefindings indicate that the efficacy of RLVR primarily arises from optimizingthe high-entropy tokens that decide reasoning directions. Collectively, ourresults highlight the potential to understand RLVR through a token-entropyperspective and optimize RLVR by leveraging high-entropy minority tokens tofurther improve LLM reasoning.