13 days ago

Table of Contents

Abstract

Can a model learn to escape its own learning plateau? Reinforcement learning methods for finetuning large reasoning models stall on datasets with low initial success rates, and thus little training signal. We investigate a fundamental question: Can a pretrained LLM leverage latent knowledge to generate an automated curriculum for problems it cannot solve? To explore this, we design SOAR: A self-improvement framework designed to surface these pedagogical signals through meta-RL. A teacher copy of the model proposes synthetic problems for a student copy, and is rewarded with its improvement on a small subset of hard problems. Critically, SOAR grounds the curriculum in measured student progress rather than intrinsic proxy rewards. Our study on the hardest subsets of mathematical benchmarks (0/128 success) reveals three core findings. First, we show that it is possible to realize bi-level meta-RL that unlocks learning under sparse, binary rewards by sharpening a latent capacity of pretrained models to generate useful stepping stones. Second, grounded rewards outperform intrinsic reward schemes used in prior LLM self-play, reliably avoiding the instability and diversity collapse modes they typically exhibit. Third, analyzing the generated questions reveals that structural quality and well-posedness are more critical for learning progress than solution correctness. Our results suggest that the ability to generate useful stepping stones does not require the preexisting ability to actually solve the hard problems, paving a principled path to escape reasoning plateaus without additional curated data.

One-sentence Summary

MIT and Meta FAIR researchers propose SOAR, a meta-RL framework enabling LLMs to generate self-curated curricula for unsolvable problems by grounding teacher rewards in student progress, not intrinsic signals, thereby escaping reasoning plateaus without curated data.

Key Contributions

SOAR introduces a bi-level meta-RL framework that enables pretrained LLMs to generate synthetic curricula for hard problems they initially cannot solve, using grounded rewards based on measurable student improvement rather than intrinsic proxies.
On math benchmarks with near-zero initial success (0/128), SOAR reliably avoids the instability and collapse seen in prior self-play methods by anchoring teacher rewards to real performance gains, not self-consistency or solution quality.
Analysis shows that structural quality and well-posedness of generated problems drive learning more than solution correctness, demonstrating that useful stepping stones can emerge without prior solving ability.

Introduction

The authors tackle a core limitation in fine-tuning large language models for reasoning: when initial success rates are near zero, reinforcement learning with verifiable rewards (RLVR) fails due to sparse signals. Prior self-play and curriculum methods rely on intrinsic or proxy rewards—like self-consistency or gradient norms—which often collapse into degenerate or unlearnable tasks, especially in symbolic domains with binary correctness. The authors introduce SOAR, a meta-RL framework where a teacher model generates synthetic problems for a student model, and is rewarded only when the student improves on a small set of real hard problems. This grounded, bilevel approach avoids reward hacking and enables learning even when the model cannot initially solve the target problems, revealing that latent knowledge can be surfaced through self-generated stepping stones without human curation.

Dataset

The authors use three math reasoning benchmarks—MATH, HARP, and OlympiadBench—to study sparse binary rewards in settings without automatic verification. These cover problems from AMC, AIME, USA(J)MO, and international Olympiads.
For each dataset, they apply a “fail@128” filter: they sample 128 solutions per problem using Llama-3.2-3B-Instruct (with 1024 token budget and temperature 1.0) and retain only problems with 0/128 success rate. This creates a challenging subset where direct training yields minimal gains.
OlympiadBench: Uses only the 674 English, text-only, automatically verifiable problems. A random 50-50 train/test split is created since the original was a test set.
HARP: Uses the full dataset, and also applies a random 50-50 train/test split for the same reason.
MATH: To avoid memorization bias (since Llama-3.2-3B-Instruct shows higher accuracy on MATH’s official train split), they draw the initial problem pool from the 5000-problem official test split. After applying fail@128 filtering, they create their own 50-50 train/test split. All training and evaluation use only this internal split.
Dataset sizes: Final train/test splits are reported in Table 2. The test sets are made larger (50% of filtered data) to reliably measure performance gains over stochastic variance.
No cropping or metadata construction is mentioned beyond the filtering and splitting. All synthetic data and student-teacher training uses only the internal training splits; final results are evaluated solely on the held-out test splits.

Method

The authors leverage a teacher-student meta-RL framework, termed SOAR, to enable a pretrained language model to generate its own stepping-stone curriculum for solving difficult problems. The framework operates as an asymmetric self-play system where two models, initialized from the same base model, are trained in a nested loop structure. The teacher model, denoted as $\pi_{\phi}^{T}$ , generates synthetic question-answer pairs $(q, a)_{\text{synthetic}}$ , which are used to train the student model, $\pi_{\theta}^{S}$ . The student's training occurs in an inner loop, where it learns to answer the teacher-generated problems using reinforcement learning. The performance of the student on a set of hard, real-world problems from the target dataset serves as the reward signal for the teacher in the outer loop. This creates a feedback mechanism where the teacher is incentivized to produce synthetic problems that lead to measurable improvement in the student's ability to solve the hard problems, without the teacher ever directly observing these hard problems.

The core of the framework is a bilevel optimization problem, where the objective is to generate a synthetic dataset $\mathcal{X}$ that, when used to train the student, maximizes the student's performance on the target domain. To make this computationally feasible, the authors instantiate this objective as a nested meta-RL loop. The outer loop trains the teacher using RLOO (Reinforcement Learning with Out-of-Loop Optimization) to generate synthetic question-answer pairs. The inner loop trains the student using standard RLVR (Reinforcement Learning with Value-based Reward) on the teacher-generated dataset. The key innovation is the grounding of the teacher's reward signal in the student's actual performance on the hard problems, rather than using intrinsic rewards. This black-box reward signal ensures that the teacher is penalized for generating degenerate or unhelpful problems and is rewarded only when the synthetic curriculum leads to genuine learning progress.

In the outer loop, the teacher generates a batch of $g \cdot n$ synthetic question-answer pairs, which are partitioned into $g$ datasets of size $n$ . For each dataset $\mathcal{X}_k$ , the inner loop is executed. This involves training the student for a fixed number of steps (10) on $\mathcal{X}_k$ and then evaluating the resulting student policy on a subsampled set of hard questions $\mathcal{Q}_R$ from the original training set. The reward for the dataset $\mathcal{X}_k$ is the average improvement in the student's success rate on $\mathcal{Q}_R$ compared to a baseline student. To mitigate noise, this reward is averaged over $r$ parallel student trainings. The teacher is then updated using the RLOO algorithm based on these dataset-level rewards.

The inner loop involves training the student on a synthetic dataset $\mathcal{X}_k$ using RLOO. The student is trained for a small number of steps to induce measurable policy movement while keeping the computational cost low. After each inner loop, the student reverts to the baseline policy for the next iteration. To address the challenge of the teacher adapting to an improving student, a promotion mechanism is introduced. A moving average of the teacher rewards is tracked, and when it exceeds a fixed threshold $\tau$ , the student baseline is reset to the best-performing student policy from the previous iteration. This accumulated dataset, which led to student promotion, is stored as the "Promotion Questions" (PQ) for evaluation.

Experiment

SOAR trains teacher-student pairs on MATH and HARP (held-out OlympiadBench for OOD testing), using Llama-3.2-3B-Instruct, with 200 outer-loop steps and n=64 samples per iteration; promotes student if 3-step moving average reward exceeds τ=0.01.
Promoted Student (PS) achieves +8.5% pass@32 on MATH and +3.6% on HARP over Hard-Only; Promotion Questions (PQ) yield +9.3% on MATH and +4.2% on HARP, confirming synthetic questions—not training trajectory—drive gains.
PQ transfers to OlympiadBench (+6% MATH-PQ, +3% HARP-PQ over Hard-Only), showing cross-dataset generalization despite no OOD optimization.
PQ recovers 75% of performance gain from full MATH training (6750 problems) and 50% from HARP; HARP-PQ outperforms 128 real HARP questions and matches 128 real MATH questions.
Grounded-T teachers outperform Intrinsic-T and Base-T, with stable student trajectories and higher diversity (Vendi Score 34.91 vs. 10.82 for Intrinsic-T); Intrinsic-T exhibits high variance and collapse in 1/3 seeds.
Synthetic questions need not be correct—only 32.8% of PQ problems are fully correct—but structural coherence and diversity matter more; meta-RL reduces ambiguity errors vs. Base-T.
Hard-Only with 4× compute (group size 128) gains only +2.8% pass@32 on MATH, far below PQ gains.
Teacher policy itself improves via meta-RL: Grounded-T questions match PQ performance and stabilize student learning curves; promotion mechanism is essential.
Multi-turn question generation underperforms single-turn; larger sampled datasets (128 vs. 64) from Grounded-T reduce variance without sacrificing mean performance.
Teacher training follows search-exploitation cycles; grounded rewards preserve diversity while intrinsic rewards collapse it during convergence.

The authors use SOAR to generate synthetic questions that improve student performance on hard problem sets, with results showing that SOAR-PQ and SOAR-PS methods significantly outperform Hard-Only and Intrinsic-T baselines across all pass@k metrics. The best-performing methods, SOAR-PQ (MATH) and SOAR-PS (HARP), achieve pass@32 scores of 12.0 ± 3.0 and 11.7 ± 1.6 respectively, demonstrating that grounded meta-RL effectively discovers useful curricula that enable learning beyond performance plateaus.

The authors use SOAR to generate synthetic questions that improve student performance on hard problem sets, with results showing that SOAR-PQ and SOAR-PS significantly outperform Hard-Only and Intrinsic-T baselines across all pass@k metrics. The best-performing method, SOAR-PQ, achieves a 19.1% pass@32 accuracy, surpassing even the full MATH train set in some cases, while also demonstrating strong cross-dataset generalization to OlympiadBench.

The authors use the table to show the effect of dataset size and number of samples on student performance in the MATH dataset. Results show that increasing the number of samples from 32 to 64 improves performance across all pass@k metrics, while increasing the dataset size from 32 to 128 questions generally leads to lower performance, especially at higher k values.

The authors use a table to compare the correctness and error types of synthetic questions generated by different teacher models. Results show that grounded rewards produce questions with higher well-posedness and correctness rates compared to intrinsic and base models, while intrinsic rewards yield the highest correctness but with significantly higher ambiguity and logic errors. The data indicates that question structure and coherence are more important than answer correctness for improving student performance.

The authors use a meta-RL framework to train a teacher model that generates synthetic problems to improve student performance on hard datasets. Results show that the teacher-generated questions significantly outperform baselines, with the best-performing methods achieving higher pass@k accuracy on MATH and HARP compared to direct training on real data.

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

13 days ago

Reinforcement Learning

Reasoning

LLM

Method/Architecture

Shobhita Sundaram John Quan Ariel Kwiatkowski Kartik Ahuja Yann Ollivier Julia Kempe

Table of Contents

Abstract

One-sentence Summary

Key Contributions

SOAR introduces a bi-level meta-RL framework that enables pretrained LLMs to generate synthetic curricula for hard problems they initially cannot solve, using grounded rewards based on measurable student improvement rather than intrinsic proxies.
On math benchmarks with near-zero initial success (0/128), SOAR reliably avoids the instability and collapse seen in prior self-play methods by anchoring teacher rewards to real performance gains, not self-consistency or solution quality.
Analysis shows that structural quality and well-posedness of generated problems drive learning more than solution correctness, demonstrating that useful stepping stones can emerge without prior solving ability.

Introduction

Dataset

The authors use three math reasoning benchmarks—MATH, HARP, and OlympiadBench—to study sparse binary rewards in settings without automatic verification. These cover problems from AMC, AIME, USA(J)MO, and international Olympiads.
For each dataset, they apply a “fail@128” filter: they sample 128 solutions per problem using Llama-3.2-3B-Instruct (with 1024 token budget and temperature 1.0) and retain only problems with 0/128 success rate. This creates a challenging subset where direct training yields minimal gains.
OlympiadBench: Uses only the 674 English, text-only, automatically verifiable problems. A random 50-50 train/test split is created since the original was a test set.
HARP: Uses the full dataset, and also applies a random 50-50 train/test split for the same reason.
MATH: To avoid memorization bias (since Llama-3.2-3B-Instruct shows higher accuracy on MATH’s official train split), they draw the initial problem pool from the 5000-problem official test split. After applying fail@128 filtering, they create their own 50-50 train/test split. All training and evaluation use only this internal split.
Dataset sizes: Final train/test splits are reported in Table 2. The test sets are made larger (50% of filtered data) to reliably measure performance gains over stochastic variance.
No cropping or metadata construction is mentioned beyond the filtering and splitting. All synthetic data and student-teacher training uses only the internal training splits; final results are evaluated solely on the held-out test splits.

Method

Experiment

SOAR trains teacher-student pairs on MATH and HARP (held-out OlympiadBench for OOD testing), using Llama-3.2-3B-Instruct, with 200 outer-loop steps and n=64 samples per iteration; promotes student if 3-step moving average reward exceeds τ=0.01.
Promoted Student (PS) achieves +8.5% pass@32 on MATH and +3.6% on HARP over Hard-Only; Promotion Questions (PQ) yield +9.3% on MATH and +4.2% on HARP, confirming synthetic questions—not training trajectory—drive gains.
PQ transfers to OlympiadBench (+6% MATH-PQ, +3% HARP-PQ over Hard-Only), showing cross-dataset generalization despite no OOD optimization.
PQ recovers 75% of performance gain from full MATH training (6750 problems) and 50% from HARP; HARP-PQ outperforms 128 real HARP questions and matches 128 real MATH questions.
Grounded-T teachers outperform Intrinsic-T and Base-T, with stable student trajectories and higher diversity (Vendi Score 34.91 vs. 10.82 for Intrinsic-T); Intrinsic-T exhibits high variance and collapse in 1/3 seeds.
Synthetic questions need not be correct—only 32.8% of PQ problems are fully correct—but structural coherence and diversity matter more; meta-RL reduces ambiguity errors vs. Base-T.
Hard-Only with 4× compute (group size 128) gains only +2.8% pass@32 on MATH, far below PQ gains.
Teacher policy itself improves via meta-RL: Grounded-T questions match PQ performance and stabilize student learning curves; promotion mechanism is essential.
Multi-turn question generation underperforms single-turn; larger sampled datasets (128 vs. 64) from Grounded-T reduce variance without sacrificing mean performance.
Teacher training follows search-exploitation cycles; grounded rewards preserve diversity while intrinsic rewards collapse it during convergence.

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability

Shobhita Sundaram John Quan Ariel Kwiatkowski Kartik Ahuja Yann Ollivier Julia Kempe

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability

Shobhita Sundaram John Quan Ariel Kwiatkowski Kartik Ahuja Yann Ollivier Julia Kempe

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability

Shobhita Sundaram John Quan Ariel Kwiatkowski Kartik Ahuja Yann Ollivier Julia Kempe

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters