Command Palette
Search for a command to run...
자신을 가르치는 모델을 가르치기: 학습 가능 경계에서의 추론
자신을 가르치는 모델을 가르치기: 학습 가능 경계에서의 추론
Shobhita Sundaram John Quan Ariel Kwiatkowski Kartik Ahuja Yann Ollivier Julia Kempe
초록
모델은 스스로의 학습 정체기를 벗어날 수 있는가? 대규모 추론 모델을 미세조정하는 강화학습 기법은 초기 성공률이 낮은 데이터셋에서 학습 신호가 부족해 학습이 정체되는 문제가 발생한다. 본 연구는 근본적인 질문을 탐구한다: 사전 학습된 대규모 언어 모델(LLM)은 해결할 수 없는 문제에 대해 자동으로 교육 과정(curriculum)을 생성할 수 있는 잠재적 지식을 활용할 수 있는가? 이를 탐색하기 위해 우리는 메타-강화학습(meta-RL)을 통해 이러한 교육적 신호를 드러내는 자기 개선 프레임워크 SOAR를 설계하였다. SOAR는 모델의 교사 복제본이 학습자 복제본을 위한 합성 문제를 제안하고, 어려운 문제의 소량의 하위 집합에서의 성과 향상에 따라 보상을 받는 구조를 취한다. 중요한 점은 SOAR가 내재적 대체 보상(reward)이 아닌, 학습자의 실제 진보를 기반으로 교육 과정을 정의한다는 것이다. 수학 벤치마크에서 가장 어려운 하위 집합(성공률 0/128)에 대한 실험을 통해 세 가지 핵심 발견을 도출하였다. 첫째, 사전 학습된 모델이 유용한 단계적 보조 문제(stepping stones)를 생성하는 잠재적 능력을 강화함으로써, 희박하고 이진형의 보상 조건 하에서도 이중 수준(bi-level)의 메타-강화학습을 실현할 수 있음을 입증하였다. 둘째, 내재적 보상 체계를 사용한 이전의 LLM 자가대결(self-play) 기법과 달리, 측정된 학습자 진보 기반의 보상이 안정성과 다양성 붕괴 현상을 효과적으로 피하며, 더욱 신뢰할 수 있는 성능을 보였다. 셋째, 생성된 문제를 분석한 결과, 해답의 정확성보다 문제의 구조적 품질과 적절한 정의 여부가 학습 진전에 더 중요한 영향을 미친다는 점을 확인하였다. 본 연구 결과는 어려운 문제를 실제로 해결할 수 있는 능력이 사전에 존재하지 않더라도, 유용한 단계적 보조 문제를 생성할 수 있음을 시사하며, 추가적인 정제된 데이터 없이 추론 정체기를 극복할 수 있는 체계적인 길을 제시한다.
One-sentence Summary
MIT and Meta FAIR researchers propose SOAR, a meta-RL framework enabling LLMs to generate self-curated curricula for unsolvable problems by grounding teacher rewards in student progress, not intrinsic signals, thereby escaping reasoning plateaus without curated data.
Key Contributions
- SOAR introduces a bi-level meta-RL framework that enables pretrained LLMs to generate synthetic curricula for hard problems they initially cannot solve, using grounded rewards based on measurable student improvement rather than intrinsic proxies.
- On math benchmarks with near-zero initial success (0/128), SOAR reliably avoids the instability and collapse seen in prior self-play methods by anchoring teacher rewards to real performance gains, not self-consistency or solution quality.
- Analysis shows that structural quality and well-posedness of generated problems drive learning more than solution correctness, demonstrating that useful stepping stones can emerge without prior solving ability.
Introduction
The authors tackle a core limitation in fine-tuning large language models for reasoning: when initial success rates are near zero, reinforcement learning with verifiable rewards (RLVR) fails due to sparse signals. Prior self-play and curriculum methods rely on intrinsic or proxy rewards—like self-consistency or gradient norms—which often collapse into degenerate or unlearnable tasks, especially in symbolic domains with binary correctness. The authors introduce SOAR, a meta-RL framework where a teacher model generates synthetic problems for a student model, and is rewarded only when the student improves on a small set of real hard problems. This grounded, bilevel approach avoids reward hacking and enables learning even when the model cannot initially solve the target problems, revealing that latent knowledge can be surfaced through self-generated stepping stones without human curation.
Dataset
-
The authors use three math reasoning benchmarks—MATH, HARP, and OlympiadBench—to study sparse binary rewards in settings without automatic verification. These cover problems from AMC, AIME, USA(J)MO, and international Olympiads.
-
For each dataset, they apply a “fail@128” filter: they sample 128 solutions per problem using Llama-3.2-3B-Instruct (with 1024 token budget and temperature 1.0) and retain only problems with 0/128 success rate. This creates a challenging subset where direct training yields minimal gains.
-
OlympiadBench: Uses only the 674 English, text-only, automatically verifiable problems. A random 50-50 train/test split is created since the original was a test set.
-
HARP: Uses the full dataset, and also applies a random 50-50 train/test split for the same reason.
-
MATH: To avoid memorization bias (since Llama-3.2-3B-Instruct shows higher accuracy on MATH’s official train split), they draw the initial problem pool from the 5000-problem official test split. After applying fail@128 filtering, they create their own 50-50 train/test split. All training and evaluation use only this internal split.
-
Dataset sizes: Final train/test splits are reported in Table 2. The test sets are made larger (50% of filtered data) to reliably measure performance gains over stochastic variance.
-
No cropping or metadata construction is mentioned beyond the filtering and splitting. All synthetic data and student-teacher training uses only the internal training splits; final results are evaluated solely on the held-out test splits.
Method
The authors leverage a teacher-student meta-RL framework, termed SOAR, to enable a pretrained language model to generate its own stepping-stone curriculum for solving difficult problems. The framework operates as an asymmetric self-play system where two models, initialized from the same base model, are trained in a nested loop structure. The teacher model, denoted as πϕT, generates synthetic question-answer pairs (q,a)synthetic, which are used to train the student model, πθS. The student's training occurs in an inner loop, where it learns to answer the teacher-generated problems using reinforcement learning. The performance of the student on a set of hard, real-world problems from the target dataset serves as the reward signal for the teacher in the outer loop. This creates a feedback mechanism where the teacher is incentivized to produce synthetic problems that lead to measurable improvement in the student's ability to solve the hard problems, without the teacher ever directly observing these hard problems.
The core of the framework is a bilevel optimization problem, where the objective is to generate a synthetic dataset X that, when used to train the student, maximizes the student's performance on the target domain. To make this computationally feasible, the authors instantiate this objective as a nested meta-RL loop. The outer loop trains the teacher using RLOO (Reinforcement Learning with Out-of-Loop Optimization) to generate synthetic question-answer pairs. The inner loop trains the student using standard RLVR (Reinforcement Learning with Value-based Reward) on the teacher-generated dataset. The key innovation is the grounding of the teacher's reward signal in the student's actual performance on the hard problems, rather than using intrinsic rewards. This black-box reward signal ensures that the teacher is penalized for generating degenerate or unhelpful problems and is rewarded only when the synthetic curriculum leads to genuine learning progress.
In the outer loop, the teacher generates a batch of g⋅n synthetic question-answer pairs, which are partitioned into g datasets of size n. For each dataset Xk, the inner loop is executed. This involves training the student for a fixed number of steps (10) on Xk and then evaluating the resulting student policy on a subsampled set of hard questions QR from the original training set. The reward for the dataset Xk is the average improvement in the student's success rate on QR compared to a baseline student. To mitigate noise, this reward is averaged over r parallel student trainings. The teacher is then updated using the RLOO algorithm based on these dataset-level rewards.
The inner loop involves training the student on a synthetic dataset Xk using RLOO. The student is trained for a small number of steps to induce measurable policy movement while keeping the computational cost low. After each inner loop, the student reverts to the baseline policy for the next iteration. To address the challenge of the teacher adapting to an improving student, a promotion mechanism is introduced. A moving average of the teacher rewards is tracked, and when it exceeds a fixed threshold τ, the student baseline is reset to the best-performing student policy from the previous iteration. This accumulated dataset, which led to student promotion, is stored as the "Promotion Questions" (PQ) for evaluation.
Experiment
- SOAR trains teacher-student pairs on MATH and HARP (held-out OlympiadBench for OOD testing), using Llama-3.2-3B-Instruct, with 200 outer-loop steps and n=64 samples per iteration; promotes student if 3-step moving average reward exceeds τ=0.01.
- Promoted Student (PS) achieves +8.5% pass@32 on MATH and +3.6% on HARP over Hard-Only; Promotion Questions (PQ) yield +9.3% on MATH and +4.2% on HARP, confirming synthetic questions—not training trajectory—drive gains.
- PQ transfers to OlympiadBench (+6% MATH-PQ, +3% HARP-PQ over Hard-Only), showing cross-dataset generalization despite no OOD optimization.
- PQ recovers 75% of performance gain from full MATH training (6750 problems) and 50% from HARP; HARP-PQ outperforms 128 real HARP questions and matches 128 real MATH questions.
- Grounded-T teachers outperform Intrinsic-T and Base-T, with stable student trajectories and higher diversity (Vendi Score 34.91 vs. 10.82 for Intrinsic-T); Intrinsic-T exhibits high variance and collapse in 1/3 seeds.
- Synthetic questions need not be correct—only 32.8% of PQ problems are fully correct—but structural coherence and diversity matter more; meta-RL reduces ambiguity errors vs. Base-T.
- Hard-Only with 4× compute (group size 128) gains only +2.8% pass@32 on MATH, far below PQ gains.
- Teacher policy itself improves via meta-RL: Grounded-T questions match PQ performance and stabilize student learning curves; promotion mechanism is essential.
- Multi-turn question generation underperforms single-turn; larger sampled datasets (128 vs. 64) from Grounded-T reduce variance without sacrificing mean performance.
- Teacher training follows search-exploitation cycles; grounded rewards preserve diversity while intrinsic rewards collapse it during convergence.
The authors use SOAR to generate synthetic questions that improve student performance on hard problem sets, with results showing that SOAR-PQ and SOAR-PS methods significantly outperform Hard-Only and Intrinsic-T baselines across all pass@k metrics. The best-performing methods, SOAR-PQ (MATH) and SOAR-PS (HARP), achieve pass@32 scores of 12.0 ± 3.0 and 11.7 ± 1.6 respectively, demonstrating that grounded meta-RL effectively discovers useful curricula that enable learning beyond performance plateaus.

The authors use SOAR to generate synthetic questions that improve student performance on hard problem sets, with results showing that SOAR-PQ and SOAR-PS significantly outperform Hard-Only and Intrinsic-T baselines across all pass@k metrics. The best-performing method, SOAR-PQ, achieves a 19.1% pass@32 accuracy, surpassing even the full MATH train set in some cases, while also demonstrating strong cross-dataset generalization to OlympiadBench.

The authors use the table to show the effect of dataset size and number of samples on student performance in the MATH dataset. Results show that increasing the number of samples from 32 to 64 improves performance across all pass@k metrics, while increasing the dataset size from 32 to 128 questions generally leads to lower performance, especially at higher k values.

The authors use a table to compare the correctness and error types of synthetic questions generated by different teacher models. Results show that grounded rewards produce questions with higher well-posedness and correctness rates compared to intrinsic and base models, while intrinsic rewards yield the highest correctness but with significantly higher ambiguity and logic errors. The data indicates that question structure and coherence are more important than answer correctness for improving student performance.

The authors use a meta-RL framework to train a teacher model that generates synthetic problems to improve student performance on hard datasets. Results show that the teacher-generated questions significantly outperform baselines, with the best-performing methods achieving higher pass@k accuracy on MATH and HARP compared to direct training on real data.
