HyperAIHyperAI

Command Palette

Search for a command to run...

POPE: 정책 탐색을 통해 어려운 문제에서 추론하는 방법 학습하기

Yuxiao Qu Amrith Setlur Virginia Smith Ruslan Salakhutdinov Aviral Kumar

초록

강화학습(RL)은 대규모 언어모델(LLM)의 추론 능력을 향상시켜 왔지만, 최첨단 기법들 역시 많은 학습 문제에 대해 학습하지 못하는 문제가 존재한다. 어려운 문제에서는 온폴리시(On-policy) RL이 단 하나의 정답 롤아웃조차 탐색하지 못하는 경우가 흔하며, 이로 인해 보상이 0이 되고 개선을 이끄는 학습 신호가 전혀 발생하지 않는다. 우리는 고전적 강화학습에서 탐색 문제를 해결하기 위한 자연스러운 접근 방식, 예를 들어 엔트로피 보너스, 중요도 비율의 더 완화된 클리핑, 또는 pass@k 목적함수의 직접 최적화 등이 이 문제를 해결하지 못하고 오히려 최적화를 불안정하게 만들며 해결 가능성 향상에도 기여하지 않는다는 점을 발견했다. 자연스러운 대안으로는 쉬운 문제로부터의 전이를 활용하는 것이다. 그러나 우리는 RL 학습 과정에서 쉬운 문제와 어려운 문제를 혼합할 경우, ‘레이 간섭(Ray Interference)’ 현상으로 인해 오히려 비효율적이 된다는 점을 보여준다. 이 현상은 최적화가 이미 해결 가능한 문제에 집중하게 되어, 더 어려운 문제에 대한 진전을 능동적으로 저해한다. 이러한 도전에 대응하기 위해 우리는 ‘특권적 온폴리시 탐색(Privileged On-Policy Exploration, POPE)’을 제안한다. POPE는 다른 방법들이 오라클 솔루션을 학습 대상으로 사용하는 것과 달리(예: 오프폴리시 RL 또는 SFT에서의 웜스타팅), 인간 또는 기타 오라클의 솔루션을 특권 정보(Privileged Information)로 활용하여 어려운 문제에서의 탐색을 안내한다. POPE는 어려운 문제에 오라클 솔루션의 접두사(Prefix)를 추가함으로써, 지도된 롤아웃 중에도 0이 아닌 보상을 얻을 수 있도록 한다. 핵심적으로, 이로 인한 행동은 지시어 수행 능력과 추론 능력 간의 상호작용을 통해 원래의 지도되지 않은 문제로 전이된다. 실험적으로 POPE는 해결 가능한 문제의 범위를 확장하고, 도전적인 추론 벤치마크에서 성능을 크게 향상시킴을 입증하였다.

One-sentence Summary

Researchers from Carnegie Mellon University propose POPE, a method that uses partial oracle solutions to guide on-policy RL exploration for hard reasoning problems, enabling models to learn from unguided tasks by leveraging instruction-following and backtracking behaviors, without destabilizing optimization or requiring oracle data as training targets.

Key Contributions

  • POPE addresses the failure of on-policy RL on hard reasoning problems by using oracle solution prefixes to guide exploration, enabling non-zero reward rollouts without training directly on oracle data as targets.
  • Unlike entropy bonuses or mixed easy-hard training—which destabilize optimization or cause ray interference—POPE leverages instruction-following to steer models toward solvable paths, preserving stable training while improving exploration.
  • Evaluated on benchmarks like DAPO-MATH-17K, POPE significantly expands the set of solvable problems and boosts performance by enabling transfer from guided to unguided problem solving.

Introduction

The authors leverage reinforcement learning to improve large language models’ reasoning on hard problems, where standard on-policy RL fails because it rarely samples any correct rollout—leaving no learning signal. Prior fixes like entropy bonuses, pass@k optimization, or mixing easy/hard problems don’t help; they either destabilize training or worsen performance due to “ray interference,” where optimization focuses on already-solvable tasks. Their main contribution is Privileged On-Policy Exploration (POPE), which uses short prefixes of human- or oracle-written solutions to guide exploration during RL—without ever training the model to imitate those prefixes. This lets the model sample successful rollouts on hard problems, and the learned behaviors transfer back to unguided settings via instruction-following and backtracking, significantly expanding the set of solvable problems.

Dataset

  • The authors use a mix of human-written and model-generated math problem solutions, drawn from two key sources: Omni-MATH (human solutions) and DAPO (solutions generated by gemini-2.5-pro).
  • Omni-MATH provides structured problem-solution pairs with step-by-step reasoning, including algebraic expressions and verification steps; examples include problems requiring construction of sets with specific sum properties.
  • DAPO contributes model-generated solutions, such as finding the smallest natural number n where n²−n+11 has exactly four prime factors, enabling comparison between human and synthetic reasoning.
  • No explicit filtering rules or size metrics are given for either subset in the provided text; the focus is on solution structure and correctness rather than dataset scale.
  • The data is used as-is for evaluation or illustration; no training split, mixture ratios, cropping, or metadata construction are mentioned in the excerpt.
  • Processing appears minimal: solutions are presented in raw LaTeX and prose form, preserving original formatting and logical flow for direct analysis or comparison.

Method

The authors leverage a novel exploration strategy called Privileged On-Policy Exploration (POPE) to address the fundamental limitation of standard on-policy reinforcement learning (RL) on hard problems, where reward signals vanish and training stalls. Rather than treating oracle solutions as direct training targets—which risks memorization or destabilizing the policy—POPE uses them solely as contextual guidance to steer the model’s own on-policy rollouts into regions of the solution space where non-zero reward becomes attainable.

The core mechanism involves augmenting each hard problem with a short, carefully selected prefix from a human-written solution, accompanied by a system instruction that directs the model to build upon this prefix. This guidance does not replace the model’s generative process; instead, it conditions the policy to begin its rollout from a more favorable internal state. The authors identify the minimal prefix length i(x)i^*(\mathbf{x})i(x) for each problem x\mathbf{x}x by evaluating which prefix enables the base model to produce at least one successful rollout. If no such prefix exists, a short random segment (less than 1/4 of the full solution) is used. The guided dataset is then constructed as:

Dhardguided:={concat(x,z0:i(x),I)xDhard}.\mathcal{D}_{\mathrm{hard}}^{\mathrm{guided}} := \left\{ \operatorname{concat}(\mathbf{x}, \mathbf{z}^{0:i^*(\mathbf{x})}, I) \mid \mathbf{x} \in \mathcal{D}_{\mathrm{hard}} \right\}.Dhardguided:={concat(x,z0:i(x),I)xDhard}.

Training proceeds on a 1:1 mixture of unguided hard problems and their guided counterparts, ensuring the model learns to generalize from guided to unguided settings. This approach remains fully on-policy: all rollouts are generated by the current policy, and no off-policy data is used for gradient updates.

Refer to the framework diagram, which contrasts POPE with standard RL and direct oracle training. Standard RL fails on hard problems due to zero advantages when all rollouts are incorrect. Training directly on oracle solutions via SFT or mid-training injection introduces optimization pathologies. POPE, in contrast, provides in-context guidance to enable reward signal without distorting the policy’s generative behavior.

The efficacy of POPE is grounded in a mental model of exploration in an MDP. Hard problems are characterized by sparse reward, where the agent must reach a subset of states SgoodS_{\text{good}}Sgood from which reward is reliably attainable. Guidance acts as a roll-in policy that steers the model into SgoodS_{\text{good}}Sgood, enabling on-policy RL to learn effective continuations. Once these continuations are learned, the model can succeed from SgoodS_{\text{good}}Sgood without guidance, reducing the remaining challenge to reaching such states from the initial problem.

As shown in the figure below, reasoning traces in LLMs often involve self-verification and backtracking, which amplify coverage over states near the initial problem. When these behaviors occur under guidance, they induce overlap between guided and unguided state spaces. This overlap allows the learned policy to generalize from guided successes to unguided prefixes, effectively reducing the exploration problem to reaching any nearby state rather than reproducing the full guidance string.

The training process is implemented using the Pipeline-RL framework with GRPO as the underlying optimizer. Actor workers generate up to 8 rollouts per prompt, which are buffered and processed by preprocessing workers before being used by learner workers for policy updates. The RL loss follows the clipped surrogate objective of GRPO, with advantages normalized by the batch mean to reduce variance. The authors do not include entropy or KL regularization terms by default, focusing instead on the impact of guidance on exploration dynamics.

The authors also explore pass@kkk optimization as an alternative to standard reward maximization, motivated by the observation that optimizing pass@1 can lead to ray interference—overfitting to easy problems at the expense of hard ones. The pass@kkk objective maximizes the probability that at least one of kkk independent rollouts succeeds, estimated via the unbiased estimator:

ρ(n,c,k)=1(nck)(nk),\rho(n, c, k) = 1 - \frac{\binom{n - c}{k}}{\binom{n}{k}},ρ(n,c,k)=1(kn)(knc),

where nnn is the number of rollouts sampled per prompt, and ccc is the number of correct rollouts. The gradient is computed as a weighted policy gradient, with weights depending on whether a rollout is correct or incorrect, allowing the model to learn from both successes and near-misses.

In summary, POPE transforms sparse-reward exploration into a two-stage problem: first, reach a state from which reward is attainable (via guidance), then learn to exploit that state (via on-policy RL). The structure of reasoning traces in LLMs, particularly self-correction and backtracking, enables transfer from guided to unguided problems by inducing overlap in the latent state space, making POPE a scalable and effective method for improving performance on hard problems without compromising the model’s generative capabilities.

Experiment

  • Token-level exploration methods (entropy bonuses, higher clip ratios) fail on hard problems, causing entropy explosion and no meaningful solvability gains.
  • Mixing easy problems with hard ones during training induces ray interference, stalling progress on hard problems despite early gains.
  • Direct optimization of pass@k metrics does not aid exploration on hard problems; it reduces reward signals and slows convergence.
  • POPE, using guided prefixes, avoids ray interference and enables steady improvement on hard problems, even when mixed with easy problems.
  • POPE’s effectiveness relies on overlap between guided and unguided reasoning paths; restricting revisiting guidance weakens transfer to unguided settings.
  • POPE outperforms supervised fine-tuning on oracle solutions, which collapse entropy and degrade performance on both hard problems and benchmarks.
  • POPE improves downstream performance on standardized benchmarks (AIME2025, HMMT2025), especially on harder ones, while maintaining robustness across heterogeneous training data.

The authors find that standard reinforcement learning and token-level exploration methods fail to solve hard problems due to entropy explosion and ray interference, while supervised fine-tuning on oracle solutions degrades performance by collapsing token entropy. In contrast, POPE—a method using guided prefixes—consistently improves solvability on hard problems, mitigates interference from easy problems, and enhances performance on standardized benchmarks even in mixed training settings.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp
POPE: 정책 탐색을 통해 어려운 문제에서 추론하는 방법 학습하기 | 문서 | HyperAI초신경