HyperAIHyperAI

Command Palette

Search for a command to run...

한 달 전

TEMPO: 대규모 추론 모델을 위한 Test-time Training의 확장

Qingyang Zhang Xinke Kong Haitao Wu Qinghua Hu Minghao Wu Baosong Yang Yu Cheng Yun Luo Ganqu Cui Changqing Zhang

초록

Test-time training (TTT)은 추론 과정에서 레이블이 없는 테스트 인스턴스에 맞춰 모델 파라미터를 조정하며, 이를 통해 오프라인 학습(offline training)의 범위를 지속적으로 확장합니다. 초기 성능 향상에도 불구하고, 기존의 LRM(Large Reasoning Models)을 위한 TTT 방식들은 빠르게 정체기에 도달하며 추가적인 테스트 시간 연산(test-time compute)의 이점을 얻지 못하는 한계가 있습니다. 외부 보정(calibration)이 없을 경우, 정책 모델(policy model)이 진화함에 따라 스스로 생성한 보상 신호가 점차 드리프트(drift)되어 성능 정체와 다양성 붕괴(diversity collapse)를 동시에 초래합니다.본 논문에서는 레이블이 없는 질문에 대한 정책 정제(policy refinement)와 레이블이 있는 데이터셋에 대한 주기적인 크리틱 재보정(critic recalibration)을 교차 수행하는 TTT 프레임워크인 TEMPO를 제안합니다. 이러한 교차 절차를 Expectation-Maximization (EM) 알고리즘을 통해 공식화함으로써, 기존 방식들이 핵심적인 재보정 단계를 누락한 불완전한 변형 모델로 해석될 수 있음을 밝혀냈습니다. 이 단계를 재도입함으로써 증거 하한(ELBO, evidence lower bound)을 강화하고 지속적인 성능 향상을 가능하게 합니다. 다양한 모델 제품군(Qwen3 및 OLMO3)과 추론 작업에 걸쳐 실험한 결과, TEMPO는 높은 다양성을 유지하면서도 OLMO3-7B의 AIME 2024 성능을 33.0%에서 51.1%로, Qwen3-14B의 성능을 42.3%에서 65.8%로 향상시켰습니다.

One-sentence Summary

By formalizing an alternating procedure of policy refinement and periodic critic recalibration through the Expectation-Maximization algorithm, the TEMPO framework prevents reward drift and diversity collapse to scale test-time training for large reasoning models, improving OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%.

Key Contributions

  • The paper introduces TEMPO, a test-time training framework that utilizes an alternating actor-critic optimization to prevent reward drift and diversity collapse in large reasoning models.
  • This work formalizes the test-time training process through the Expectation-Maximization (EM) algorithm, identifying that prior methods fail because they omit the crucial E-step of periodic critic recalibration on labeled data.
  • Experimental results demonstrate that TEMPO enables sustained performance improvements across diverse model families, such as increasing OLMO3-7B accuracy from 33.0% to 51.1% on the AIME 2024 benchmark while maintaining high output diversity.

Introduction

Large reasoning models (LRMs) often rely on static parameters that cannot incorporate new knowledge acquired during inference. Test-time training (TTT) attempts to solve this by adapting model parameters on unlabeled test data to extend reasoning capabilities. However, existing TTT methods rely on heuristic, self-generated reward signals that cause performance to plateau and output diversity to collapse as the model's internal rewards drift from true correctness. The authors leverage an Expectation-Maximization (EM) framework to propose TEMPO, a TTT method that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset. By reintroducing this crucial recalibration step, TEMPO provides a stable training signal that enables sustained performance gains and maintains high output diversity across various reasoning tasks.

Method

The authors propose Test-time Expectation-Maximization Policy Optimization (TEMP0), a framework that enables large reasoning models (LRMs) to continuously self-improve during the test phase by alternating between two core modules: critic calibration and policy refinement, inspired by the Expectation-Maximization (EM) algorithm. The overall architecture operates in an iterative loop where the model leverages both labeled and unlabeled data to refine its behavior.

The process begins with the initialization of both the policy and critic models using a reinforcement learning with verification and reward (RLVR) procedure on the labeled dataset DLD_LDL. The framework then proceeds to iteratively alternate between two distinct steps. In the E-step, referred to as critic calibration, the critic model is updated to ensure its reward predictions remain grounded in external supervision. This is achieved by training the critic model Vϕ(x,yt)V_{\phi}(x, y_t)Vϕ(x,yt) on the labeled data DLD_LDL, where it learns to predict the correctness of generated responses at the token level. The critic is optimized by minimizing the mean squared error (MSE) between its predictions and the ground-truth binary correctness indicators, ensuring it provides a reliable and calibrated measure of response quality. The resulting critic serves as a surrogate for the posterior distribution over correct responses, enabling the model to reweight its own generations.

In the M-step, or policy refinement, the model uses the calibrated critic to guide its own self-improvement on unlabeled test data. The policy parameters θ\thetaθ are updated by maximizing a weighted maximum likelihood objective, where the weights are derived from the critic's predictions on the final token of the response, Vϕ(x,yT)V_{\phi}(x, y_T)Vϕ(x,yT). This objective is implemented via a policy gradient framework, where the critic's final value serves as the ground-truth reward RRR for the entire response trajectory. To stabilize training, the critic's intermediate value predictions at each token y1:ty_{1:t}y1:t are used as a baseline btb_tbt, and the advantage AtA_tAt for each token is computed as the difference between the final reward and the baseline, At=RVϕ(x,y1:t)A_t = R - V_{\phi}(x, y_{1:t})At=RVϕ(x,y1:t). This advantage signal is then used to update the policy, reinforcing actions that contribute to high-quality outputs.

The data flow of the system is structured to support this alternating process. Unlabeled data xux^uxu is fed into the policy model to generate responses yuy^uyu, which are then evaluated by the critic model. Labeled data xlx^lxl is used to provide rewards rrr that directly inform the critic calibration process. The optimization flow, indicated by dashed lines, shows that the critic is periodically updated based on the labeled data, and the policy is refined based on the critic's evaluations of its own outputs. This continuous loop of calibration and refinement allows the model to achieve sustained self-improvement on open reasoning problems.

Experiment

TEMPO is evaluated across mathematical and general domain reasoning tasks using various base models and benchmarks to validate its effectiveness. The experiments demonstrate that TEMPO achieves sustained scalability beyond standard reinforcement learning ceilings and maintains high output diversity without the reasoning collapse seen in baseline methods. Furthermore, the results confirm the framework's versatility across different reasoning domains and validate that its alternating training design is essential for preventing critic misalignment during self-improvement.

The authors evaluate TEMPO across multiple models and benchmarks, demonstrating that it consistently outperforms baselines in both mathematical and general reasoning tasks. Results show that TEMPO achieves significant improvements over zero-RL baselines, maintains high output diversity, and sustains performance gains through test-time training without plateauing. The method's effectiveness is attributed to its alternating training design, which prevents reward signal drift and enables continuous self-improvement. TEMPO consistently outperforms baselines across model scales and benchmarks, achieving substantial gains in accuracy and pass@k metrics. TEMPO preserves output diversity during test-time training, avoiding the collapse seen in other methods that converge to narrow reasoning patterns. The alternating training design in TEMPO is essential for sustained improvement, as a frozen critic leads to performance stagnation over time.

The authors evaluate TEMPO on general reasoning tasks beyond mathematical reasoning, comparing it against baselines such as PPO, TTRL, and EMPO. Results show that TEMPO achieves significant improvements across multiple benchmarks and models, particularly on complex domains like GPQA-Diamond, while maintaining high diversity in outputs. The method demonstrates robust performance gains even when starting from a converged model, indicating its effectiveness in leveraging test-time data for sustained capability improvement. TEMPO achieves substantial improvements across diverse reasoning tasks, including BigBenchHard, AGI Eval, ZebraLogic, and GPQA-Diamond, outperforming baselines like TTRL and EMPO. The method maintains high output diversity, avoiding the collapse seen in other self-training approaches, which leads to consistent gains in pass@k metrics. TEMPO continues to improve beyond convergence points, demonstrating that test-time training on novel data enables performance gains beyond the limits of standard RLVR.

TEMPO is evaluated across various model scales and reasoning benchmarks, including mathematical and general reasoning tasks, to compare its performance against standard reinforcement learning and test-time training baselines. The experiments demonstrate that TEMPO consistently improves reasoning capabilities and maintains high output diversity without the performance collapse or stagnation seen in other methods. These results suggest that the alternating training design effectively prevents reward signal drift and enables sustained self-improvement through test-time training.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp