Command Palette
Search for a command to run...
DSDR: تنظيم التنوّع ثنائي المقياس لاستكشاف التفكير في LLM
DSDR: تنظيم التنوّع ثنائي المقياس لاستكشاف التفكير في LLM
الملخص
يُعدّ التعلم بالتعزيز مع المُتحققين (RLVR) نموذجًا رئيسيًا لتحسين الاستدلال في النماذج اللغوية الكبيرة (LLM)، لكن الطرق الحالية غالبًا ما تعاني من استكشاف محدود. فتُظهر السياسات اتجاهًا إلى الانهيار في عدد قليل من أنماط الاستدلال، وتتوقف مبكرًا عن الاستكشاف العميق، في حين أن الت régularization بالانفراجية التقليدية يُدخل فقط عشوائية محلية، ولا يُحفّز تنوعًا مفيدًا على مستوى المسارات، مما يؤدي إلى إشارات تعلّم ضعيفة وغير مستقرة في تحسين السياسات القائمة على المجموعات. نقترح DSDR، وهي إطار تعلم بالتعزيز يعتمد على تنظيم التنوّع ثنائي المقياس، الذي يُفكّك التنوّع في استدلال النماذج اللغوية الكبيرة إلى مكونات عالمية ومتّصلة. من الناحية العالمية، يُعزّز DSDR التنوّع بين المسارات الصحيحة للاستدلال، بهدف استكشاف أنماط حلّ مختلفة. من الناحية المحلية، يطبّق تنظيمًا بالانفراجية على مستوى الرموز (token-level)، مستقلًا عن الطول، ويقتصر فقط على المسارات الصحيحة، مما يمنع انكماش الانفراجية داخل كل نمط مع الحفاظ على الصحة. وترتبط المقياسان عبر آلية توزيع من العالمي إلى المحلي، التي تُبرز التنظيم المحلي للمسارات الصحيحة الأكثر تميّزًا. نقدّم دعمًا نظريًا يُظهر أن DSDR يُحافظ على الصحة المثلى تحت قيود التنظيم، ويُحافظ على إشارات تعلّم مفيدة في التحسين القائم على المجموعات، ويُنتج قاعدة منطقية للربط من العالمي إلى المحلي. تُظهر التجارب على عدة معايير للاستدلال تحسينات مستمرة في الدقة ومؤشر pass@k، مما يُبرز أهمية التنوّع ثنائي المقياس للاستكشاف العميق في RLVR. يُتاح الكود عبر الرابط: https://github.com/SUSTechBruce/DSDR.
One-sentence Summary
Zhongwei Wan, Yun Shen, and colleagues from multiple institutions propose DSDR, a dual-scale diversity regularization method enhancing exploration in LLM reasoning by balancing local and global diversity, outperforming prior techniques in complex reasoning tasks without requiring architectural changes.
Key Contributions
- DSDR introduces a dual-scale exploration framework for LLM reasoning that separates global diversity (across distinct correct solution paths) from local diversity (within each path), addressing the collapse into homogeneous reasoning patterns that plagues existing RLVR methods.
- The method couples these scales via a global-to-local allocation mechanism that applies length-invariant, correctness-restricted token-level entropy only to more distinctive correct trajectories, preserving solution quality while preventing intra-mode entropy collapse.
- Theoretical analysis confirms DSDR maintains optimal correctness and sustains informative learning signals in group-based RLVR, with empirical validation across reasoning benchmarks showing consistent gains in accuracy and pass@k.
Introduction
The authors leverage reinforcement learning with verifiers (RLVR) to enhance LLM reasoning but observe that existing methods suffer from poor exploration, collapsing into a few homogeneous reasoning patterns that hurt pass@k performance and generalization. Prior entropy-based or diversity-focused techniques either inject only local randomness or treat global and local diversity in isolation, failing to sustain meaningful, correctness-aligned exploration across scales. Their main contribution is DSDR, a dual-scale framework that jointly optimizes global diversity among correct trajectories and local token-level entropy regularization—restricted to correct paths and allocated adaptively based on trajectory distinctiveness—thereby preserving correctness while strengthening learning signals and enabling deeper, more stable exploration.
Method
The authors leverage a dual-scale diversity regularization framework, DSDR, built atop Group Relative Policy Optimization (GRPO), to enhance exploration and prevent collapse in reinforcement learning with verifiable rewards (RLVR) for reasoning tasks. The core innovation lies in coupling global trajectory-level diversity with local token-level entropy regularization, ensuring that exploration is both broad across distinct solution modes and fine-grained within high-value paths.
The overall training protocol follows a group-based sampling scheme: for each problem input, the policy model generates a group of candidate rollouts, which are then evaluated by a verifier to yield binary rewards. DSDR augments this process with two complementary diversity signals. At the global scale, it computes a bounded diversity score for each rollout by combining semantic dissimilarity—derived from frozen text embeddings—and formula-level uniqueness, which captures distinct symbolic manipulations. This global diversity score is then used to shape rewards exclusively for correct rollouts, preventing reward hacking and preserving the correctness objective while introducing controlled dispersion among valid solutions. This mechanism ensures that even when all sampled trajectories are correct, the group-relative advantages remain non-degenerate, maintaining a meaningful learning signal.
Refer to the framework diagram, which illustrates how global diversity scores are computed from multiple rollout trajectories and then fed into a global-to-local coupling module. This module allocates local regularization strength via a diversity-weighted softmax over correct rollouts. The temperature parameter in the softmax controls the concentration of exploration: higher values focus regularization on the most globally distinctive correct solutions, while lower values distribute it more uniformly. This coupling ensures that local entropy expansion is applied where it is most beneficial—around trajectories that are already rare within the group—thereby populating underexplored regions of the correct solution manifold.
As shown in the figure below, the local regularization component operates at the token level, encouraging controlled entropy along positive trajectories. It uses a time-averaged conditional entropy objective, re-expressed via importance sampling to remain differentiable and computable from the same rollouts used for policy update. This formulation avoids length bias and restricts entropy promotion to correct paths only, ensuring that exploration refines rather than corrupts valid reasoning. The final DSDR objective combines the GRPO policy loss, computed with augmented rewards, and the local entropy regularizer, weighted by the coupling coefficients derived from global diversity.

The architecture’s dual-scale design is further visualized in the solution space comparison: while baseline exploration tends to collapse into a single correct mode, DSDR’s coupled regularization expands probability mass around multiple distinct correct trajectories, as indicated by the green arrows and broader basins in the right-hand plot. This structured exploration enables the model to discover diverse reasoning paths without sacrificing correctness, addressing a key limitation of standard RLVR methods.

Experiment
- DSDR consistently outperforms baseline methods (Backbone, GRPO, DAPO) across multiple math reasoning benchmarks and model scales, showing scalable gains in both Pass@1 and Avg@16, indicating improved solution stability and quality.
- The method excels on challenging benchmarks like AIME2024/2025 and Olympiad problems, where multiple valid reasoning paths exist, by preserving informative learning signals and mitigating reward-variance collapse.
- DSDR expands the diversity of correct reasoning trajectories rather than sharpening a single solution, leading to stronger and more reliable Pass@k performance across k values, especially on benchmarks with sparse correct solutions.
- Ablation studies confirm that both global diversity (GD) and global-to-local coupling (GC) are essential: removing either degrades performance, particularly on complex benchmarks, showing their complementary roles in promoting targeted exploration.
- Training dynamics reveal that DSDR maintains balanced exploration—higher entropy without instability—by combining correct-only global diversity with local regularization, preventing mode collapse while sustaining semantic and formula-level diversity.
- Diversity analysis shows DSDR generates responses with broader reasoning strategies without sacrificing correctness, validated by higher diversity scores and pass@32 performance compared to DAPO.
- Hyperparameter sensitivity tests indicate DSDR is stable within moderate regularization ranges, with λℓ = 0.001 and λd = 0.001 yielding optimal and consistent performance across benchmarks.
- Case studies demonstrate DSDR’s ability to produce multiple distinct correct solutions per problem, while DAPO tends toward limited or erroneous reasoning patterns, confirming DSDR’s effectiveness in controlled, correctness-preserving exploration.
The authors use DSDR to enhance mathematical reasoning in large language models by promoting diversity among correct solution trajectories. Results show consistent improvements over baseline methods across multiple benchmarks and model sizes, with gains becoming more pronounced as model capacity increases. Ablation studies confirm that both global diversity and global-to-local coupling are essential for maintaining stable exploration and preventing reward collapse.
