HyperAIHyperAI

Command Palette

Search for a command to run...

DSDR: LLM 추론에서 탐색을 위한 이중 스케일 다각도 정규화

초록

검증자를 활용한 강화학습(RLVR)은 대규모 언어모델(LLM)의 추론 능력을 향상시키는 핵심적 개념이지만, 기존 방법들은 탐색 범위가 제한적이라는 문제를 겪는다. 정책(policy)은 특정 추론 패턴에 집중되어 깊이 있는 탐색을 조기에 중단하는 경향이 있으며, 전통적인 엔트로피 정규화는 지역적인 확률적 성격만을 제공할 뿐 의미 있는 경로 수준의 다양성을 유도하지 못해 그룹 기반 정책 최적화에서 약하고 불안정한 학습 신호를 초래한다. 본 연구에서는 LLM 추론의 다양성을 전역적(global) 및 결합적(coupling) 구성 요소로 분해하는 이중 규모 다양성 정규화(DSDR) 프레임워크를 제안한다. 전역적으로 DSDR은 정답 추론 경로 간의 다양성을 증진시켜 서로 다른 해결 방식을 탐색하도록 유도한다. 지역적으로는 정답 경로에 제한된 길이 불변의 토큰 수준 엔트로피 정규화를 적용하여 각 해결 모드 내에서 엔트로피 붕괴를 방지하면서도 정확성을 유지한다. 두 규모는 전역에서 지역으로의 정규화 자원 할당 메커니즘을 통해 결합되며, 더 두드러진 정답 경로에 대해 지역적 정규화를 강조한다. 이론적으로 DSDR이 제한된 정규화 하에서도 최적의 정확도를 유지하고, 그룹 기반 최적화에서 정보가 풍부한 학습 신호를 유지하며, 원칙적인 전역-지역 결합 규칙을 도출함을 보여준다. 다양한 추론 벤치마크에서의 실험 결과는 정확도와 pass@k에서 일관된 성능 향상을 보이며, RLVR에서 깊이 있는 탐색을 위해 이중 규모의 다양성이 필수적임을 입증한다. 코드는 https://github.com/SUSTechBruce/DSDR 에서 공개되어 있다.

One-sentence Summary

Zhongwei Wan, Yun Shen, and colleagues from multiple institutions propose DSDR, a dual-scale diversity regularization method enhancing exploration in LLM reasoning by balancing local and global diversity, outperforming prior techniques in complex reasoning tasks without requiring architectural changes.

Key Contributions

  • DSDR introduces a dual-scale exploration framework for LLM reasoning that separates global diversity (across distinct correct solution paths) from local diversity (within each path), addressing the collapse into homogeneous reasoning patterns that plagues existing RLVR methods.
  • The method couples these scales via a global-to-local allocation mechanism that applies length-invariant, correctness-restricted token-level entropy only to more distinctive correct trajectories, preserving solution quality while preventing intra-mode entropy collapse.
  • Theoretical analysis confirms DSDR maintains optimal correctness and sustains informative learning signals in group-based RLVR, with empirical validation across reasoning benchmarks showing consistent gains in accuracy and pass@k.

Introduction

The authors leverage reinforcement learning with verifiers (RLVR) to enhance LLM reasoning but observe that existing methods suffer from poor exploration, collapsing into a few homogeneous reasoning patterns that hurt pass@k performance and generalization. Prior entropy-based or diversity-focused techniques either inject only local randomness or treat global and local diversity in isolation, failing to sustain meaningful, correctness-aligned exploration across scales. Their main contribution is DSDR, a dual-scale framework that jointly optimizes global diversity among correct trajectories and local token-level entropy regularization—restricted to correct paths and allocated adaptively based on trajectory distinctiveness—thereby preserving correctness while strengthening learning signals and enabling deeper, more stable exploration.

Method

The authors leverage a dual-scale diversity regularization framework, DSDR, built atop Group Relative Policy Optimization (GRPO), to enhance exploration and prevent collapse in reinforcement learning with verifiable rewards (RLVR) for reasoning tasks. The core innovation lies in coupling global trajectory-level diversity with local token-level entropy regularization, ensuring that exploration is both broad across distinct solution modes and fine-grained within high-value paths.

The overall training protocol follows a group-based sampling scheme: for each problem input, the policy model generates a group of candidate rollouts, which are then evaluated by a verifier to yield binary rewards. DSDR augments this process with two complementary diversity signals. At the global scale, it computes a bounded diversity score for each rollout by combining semantic dissimilarity—derived from frozen text embeddings—and formula-level uniqueness, which captures distinct symbolic manipulations. This global diversity score is then used to shape rewards exclusively for correct rollouts, preventing reward hacking and preserving the correctness objective while introducing controlled dispersion among valid solutions. This mechanism ensures that even when all sampled trajectories are correct, the group-relative advantages remain non-degenerate, maintaining a meaningful learning signal.

Refer to the framework diagram, which illustrates how global diversity scores are computed from multiple rollout trajectories and then fed into a global-to-local coupling module. This module allocates local regularization strength via a diversity-weighted softmax over correct rollouts. The temperature parameter in the softmax controls the concentration of exploration: higher values focus regularization on the most globally distinctive correct solutions, while lower values distribute it more uniformly. This coupling ensures that local entropy expansion is applied where it is most beneficial—around trajectories that are already rare within the group—thereby populating underexplored regions of the correct solution manifold.

As shown in the figure below, the local regularization component operates at the token level, encouraging controlled entropy along positive trajectories. It uses a time-averaged conditional entropy objective, re-expressed via importance sampling to remain differentiable and computable from the same rollouts used for policy update. This formulation avoids length bias and restricts entropy promotion to correct paths only, ensuring that exploration refines rather than corrupts valid reasoning. The final DSDR objective combines the GRPO policy loss, computed with augmented rewards, and the local entropy regularizer, weighted by the coupling coefficients derived from global diversity.

The architecture’s dual-scale design is further visualized in the solution space comparison: while baseline exploration tends to collapse into a single correct mode, DSDR’s coupled regularization expands probability mass around multiple distinct correct trajectories, as indicated by the green arrows and broader basins in the right-hand plot. This structured exploration enables the model to discover diverse reasoning paths without sacrificing correctness, addressing a key limitation of standard RLVR methods.

Experiment

  • DSDR consistently outperforms baseline methods (Backbone, GRPO, DAPO) across multiple math reasoning benchmarks and model scales, showing scalable gains in both Pass@1 and Avg@16, indicating improved solution stability and quality.
  • The method excels on challenging benchmarks like AIME2024/2025 and Olympiad problems, where multiple valid reasoning paths exist, by preserving informative learning signals and mitigating reward-variance collapse.
  • DSDR expands the diversity of correct reasoning trajectories rather than sharpening a single solution, leading to stronger and more reliable Pass@k performance across k values, especially on benchmarks with sparse correct solutions.
  • Ablation studies confirm that both global diversity (GD) and global-to-local coupling (GC) are essential: removing either degrades performance, particularly on complex benchmarks, showing their complementary roles in promoting targeted exploration.
  • Training dynamics reveal that DSDR maintains balanced exploration—higher entropy without instability—by combining correct-only global diversity with local regularization, preventing mode collapse while sustaining semantic and formula-level diversity.
  • Diversity analysis shows DSDR generates responses with broader reasoning strategies without sacrificing correctness, validated by higher diversity scores and pass@32 performance compared to DAPO.
  • Hyperparameter sensitivity tests indicate DSDR is stable within moderate regularization ranges, with λℓ = 0.001 and λd = 0.001 yielding optimal and consistent performance across benchmarks.
  • Case studies demonstrate DSDR’s ability to produce multiple distinct correct solutions per problem, while DAPO tends toward limited or erroneous reasoning patterns, confirming DSDR’s effectiveness in controlled, correctness-preserving exploration.

The authors use DSDR to enhance mathematical reasoning in large language models by promoting diversity among correct solution trajectories. Results show consistent improvements over baseline methods across multiple benchmarks and model sizes, with gains becoming more pronounced as model capacity increases. Ablation studies confirm that both global diversity and global-to-local coupling are essential for maintaining stable exploration and preventing reward collapse.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp
DSDR: LLM 추론에서 탐색을 위한 이중 스케일 다각도 정규화 | 문서 | HyperAI초신경