Command Palette
Search for a command to run...
희귀한 것을 보상하기: LLM에서 창의적 문제 해결을 위한 고유성 인지 강화 학습
희귀한 것을 보상하기: LLM에서 창의적 문제 해결을 위한 고유성 인지 강화 학습
Zhiyuan Hu Yucheng Wang Yufei He Jiaying Wu Yilun Zhao See-Kiong Ng Cynthia Breazeal Anh Tuan Luu Hae Won Park Bryan Hooi
초록
강화학습(RL)은 복잡한 추론 작업에 특히 중심적인 패러다임으로 부상하여 대규모 언어 모델(LLM)의 사후 훈련에 널리 활용되고 있으나, 종종 탐색 붕괴(exploration collapse) 문제에 직면한다. 즉, 정책이 조기적으로 일부 주요 추론 패턴에 집중하게 되어 pass@1은 개선되지만, 롤아웃 수준의 다양성과 pass@k에서의 성능 향상은 제한된다. 본 연구에서는 이러한 실패의 근본 원인이 해답 집합의 다양성보다는 국소적인 토큰 행동에 대한 정규화에 있음을 지적한다. 이를 해결하기 위해, 우리는 고유성 인식 강화학습(Unique-Aware Reinforcement Learning)을 제안한다. 이 방법은 동일한 문제에 대해 생성된 롤아웃들을 고수준 추론 전략에 따라 군집화하고, 표면적인 차이를 무시한 채, 군집 크기와 반비례하는 가중치를 정책의 이점을 재조정함으로써, 희귀한 고수준 전략을 활용한 정확한 해답에 명시적으로 보상을 부여하는 롤아웃 수준의 목적함수를 도입한다. 그 결과, 기존에 반복되는 전략보다 새로운 전략을 사용한 정확한 해답이 더 높은 보상을 받게 된다. 수학, 물리학, 의학 추론 벤치마크를 대상으로 한 실험에서, 본 방법은 큰 샘플링 예산 하에서도 pass@k를 일관되게 향상시키며, pass@1 성능을 희생하지 않고 pass@k 곡선 아래 면적(AUC@K)을 증가시켰다. 또한, 대규모 환경에서도 탐색 능력을 유지하면서 보다 다양한 해답 전략을 탐색하는 데 성공하였다.
One-sentence Summary
The authors, affiliated with MIT, NUS, Yale, and NTU, propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that rewards rare high-level reasoning strategies via LLM-based clustering and inverse cluster-size reweighting, improving solution diversity and pass@k performance across mathematics, physics, and medical reasoning benchmarks without sacrificing pass@1.
Key Contributions
- Reinforcement learning for large language models often suffers from exploration collapse, where policies converge to a narrow set of dominant reasoning patterns, limiting solution diversity despite improvements in pass@1; this work identifies the root cause as the misalignment between token-level regularization and the need for strategy-level diversity in complex reasoning tasks.
- The authors propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that uses an LLM-based judge to cluster solutions by high-level strategy, reweighting policy advantages inversely with cluster size to reward rare, correct solution paths while downweighting redundant ones.
- Evaluated on mathematics, physics, and medical reasoning benchmarks, the method consistently improves pass@k across large sampling budgets and increases AUC@k without sacrificing pass@1, demonstrating enhanced exploration and broader coverage of human-annotated solution strategies.
Introduction
Reinforcement learning (RL) for post-training large language models (LLMs) is critical for enhancing complex reasoning, but it often suffers from exploration collapse—where policies converge to a narrow set of dominant reasoning patterns, improving pass@1 while failing to boost pass@k due to insufficient diversity in solution strategies. Prior methods attempt to address this through token-level diversity signals like entropy bonuses or embedding distances, but these fail to capture high-level strategic differences, treating superficial variations as meaningful diversity. The authors introduce Uniqueness-Aware Reinforcement Learning, a rollout-level objective that uses an LLM-based judge to cluster multiple solution attempts by their high-level strategies, then reweights policy advantages inversely with cluster size. This rewards correct, rare strategies while downweighting common ones, promoting genuine strategy-level diversity. Evaluated across mathematics, physics, and medical reasoning benchmarks, the method consistently improves pass@k and AUC@k without sacrificing pass@1, enabling sustained exploration even at large sampling budgets.
Dataset
- The dataset comprises domain-specific reasoning problems from three disciplines: mathematics, physics, and medicine, curated for reinforcement learning (RL) training.
- For mathematics, the authors use a difficulty-filtered subset of MATH (Hendrycks et al., 2021), selecting 8,523 problems from Levels 3–5—representing harder, more complex questions—suitable for advanced reasoning tasks.
- In physics, the dataset is derived from the textbook reasoning split of MegaScience (Fan et al., 2025), with 7,000 examples randomly sampled from a pool of 1.25 million textbook-based items to ensure broad coverage of conceptual reasoning.
- For medicine, 3,000 examples are randomly selected from MedCaseReasoning (Wu et al., 2025), which contains 13.1k total cases, focusing on clinical reasoning scenarios relevant to diagnostic decision-making.
- The training process uses these subsets with a fixed mixture ratio across domains, combining them into a unified training mix to support multi-domain reasoning.
- Each training example is processed with standardized formatting, including structured prompts and ground truth answers, and is used in rollout-based RL training with 8 rollouts per prompt.
- The authors apply a cropping strategy to limit generation length: 4096 new tokens for Qwen-2.5 and 20480 for Qwen-3 and OLMo-3 models, ensuring efficient training while preserving context.
- Metadata for each example includes domain, problem type, and difficulty level, constructed from source annotations and used to guide training and evaluation.
- The models are trained using AdamW with a learning rate of 5×10⁻⁷, KL regularization (λ_KL = 0.001), and temperature T = 1.0 during generation.
- Evaluation is conducted on held-out benchmarks: AIME 2024&2025 and HLE (math), OlympiadBench (physics), and MedCaseReasoning test set (medicine), all using text-only questions.
- Performance is measured using pass@k and AUC@K, with AUC@K computed via the trapezoidal rule to summarize overall performance across different inference budgets.
Method
The authors leverage a group-based reinforcement learning framework, extending Group Relative Policy Optimization (GRPO), to enhance the diversity of solution strategies in large language models. The overall method operates by reweighting policy update advantages to favor correct but rare solution strategies, thereby mitigating exploration collapse. As shown in the framework diagram, the process begins with a problem input, from which multiple reasoning traces (rollouts) are generated by the language model. These rollouts are then processed through an LLM-based classifier that groups them into strategy clusters based on high-level solution ideas, such as geometric packing or finite differences, rather than surface-level variations. The classification step is critical for identifying solution strategy uniqueness.
The core of the method lies in the advantage calculation, which combines both solution quality and strategy uniqueness. For each problem, the policy generates K rollouts, each receiving a scalar reward from a task-specific verifier. In vanilla GRPO, the group-normalized advantage for a rollout pm,k is computed as zm,k=(rm,k−μm)/(σm+ε), where μm and σm are the mean and standard deviation of rewards within the group. The authors modify this by introducing a uniqueness weight wm,k=1/fm,kα, where fm,k is the size of the strategy cluster to which rollout pm,k belongs, and α is a hyperparameter controlling the reweighting strength. This weight ensures that rollouts belonging to small, rare clusters (e.g., a unique but correct approach) receive a larger effective advantage, while rollouts in large, common clusters are downweighted. The final advantage used for policy updates is the product of the quality-normalized term and the uniqueness weight: advantagem,k=wm,kzm,k.
The training objective remains consistent with GRPO, using the modified advantage term in a policy-gradient objective. The policy parameters are updated to maximize the expected advantage-weighted log-likelihood of the rollouts. This approach effectively encourages the policy to explore and exploit a broader range of high-level solution strategies for each problem, rather than converging to a single dominant mode. The method is designed as a drop-in replacement for the standard GRPO advantage, making it straightforward to integrate into existing reinforcement learning pipelines for language models.
Experiment
- Evaluated pass@k performance across math (AIME 2024/2025, HLE), physics (OlympiadBench-Physics), and medicine (MedCaseReasoning) domains using Qwen2.5-7B; our uniqueness-aware RL method (OURS) consistently outperforms both instruction baseline and GRPO-only SimpleRL, especially at medium-to-large budgets (k ≥ 32), with higher asymptotic accuracy and improved pass@k slope.
- Achieved highest AUC@K across all domains and budgets (K=64/128/256) on Qwen2.5-7B, with significant gains on challenging AIME and HLE settings (e.g., +0.058 at K=128 on AIME), indicating superior accuracy–coverage trade-off and reduced mode collapse.
- On OLMo-3-7B and Qwen-3-8B, OURS again achieved the best AUC@K on HLE/Physics, outperforming alternative exploration methods (DAPO, Forking Token), demonstrating generalization and complementary benefits from uniqueness-aware training.
- Demonstrated sustained exploration via entropy dynamics: OURS maintains higher and more stable policy entropy during training compared to SimpleRL, which exhibits decreasing entropy, indicating preserved diversity in solution trajectories.
- Introduced cover@n to measure human solution coverage; on 20 challenging AIME problems, OURS achieved higher cover@32 than the instruct baseline on 4 complex problems, including full coverage (100%) on a geometry problem by recovering rare strategies like Symmedian Similarity, and 75% coverage on a combinatorics problem by adding Trail/Flow Viewpoint.
The authors use AUC@K to evaluate the accuracy-coverage trade-off across different models and domains, with higher values indicating better performance. Results show that the proposed method (OURS) consistently achieves the highest AUC@K across all budgets (64, 128, 256) and domains (AIME, HLE, Physics, Medicine) on the Qwen2.5-7B model, outperforming both the Instruct baseline and the SimpleRL baseline, with the largest gains observed in the more challenging AIME and HLE settings.

The authors use AUC@K to evaluate the accuracy-coverage trade-off across different models and methods. Results show that their uniqueness-aware RL method (OURS) achieves the highest AUC@64 and AUC@128 on both HLE and Physics benchmarks for both OLMo-3-7B and Qwen-3-8B models, outperforming both the Instruct baseline and the SimpleRL baseline, with the largest gains observed on the more challenging HLE setting.
