Command Palette
Search for a command to run...
LLMs가 LLMs를 개선하다: 테스트 시간 확장을 위한 Agentic 발견
LLMs가 LLMs를 개선하다: 테스트 시간 확장을 위한 Agentic 발견
초록
테스트 타임 스케일링(Test-time scaling, TTS)은 추론 단계에서 추가적인 연산을 할당하여 대규모 언어 모델(Large Language Model, LLM)의 성능을 향상시키는 효과적인 방법으로 자리 잡았습니다. 그러나 기존 TTS 전략은 대부분 수작업으로 설계된 방식에 의존해 왔습니다. 즉, 연구자들은 직관에 기반하여 추론 패턴을 수동적으로 설계하고 휴리스틱(heuristic)을 튜닝해 왔으며, 이로 인해 연산 할당 영역의 상당 부분이 미탐구 상태로 남아 있었습니다.우리는 연구자가 설계하는 대상을 개별 TTS 휴리스틱에서 TTS 전략이 자동 발견될 수 있는 환경으로 전환하는 환경 기반 프레임워크인 AutoTTS를 제안합니다. AutoTTS의 핵심은 환경 구성에 있으며, 발견(discovery) 환경은 제어 공간(control space)을 다룰 수 있게 tractable하게 만들고 TTS 탐색을 위해 저렴하고 빈번한 피드백을 제공해야 합니다. 구체적으로, 우리는 폭-폭신 TTS(width-depth TTS)를 사전 수집된 추론 궤적과 신호(probe signals)에 대한 제어기 합성(controller synthesis) 문제로 형식화합니다. 여기서 제어기는 분기, 계속, 신호 probing, 가지치기(prune), 또는 종료 시점을 결정하며, 반복적인 LLM 호출 없이 저렴하게 평가할 수 있습니다.또한, 탐색을 tractable하게 만들고 세분화된 실행 추적 실행(trace) 피드백을 도입하여 TTS 프로그램이 실패하는 원인을 진단함으로써 발견 효율성을 높였습니다. 수학 추론 벤치마크에 대한 실험 결과, 발견된 전략들은 강력한 수동 설계 베이스라인보다 전반적인 정확도-비용 균형(accuracy-cost tradeoff)을 개선했습니다. 발견된 전략은 unseen 벤치마크와 모델 규모로 일반화되었으며, 전체 발견 과정의 비용은 단 39.9달러와 160분에 불과했습니다. 본 연구의 데이터와 코드는 https://github.com/zhengkid/AutoTTS에서 오픈소스로 공개될 예정입니다.
One-sentence Summary
The authors propose AutoTTS, an environment-driven framework that automatically discovers test-time scaling strategies by formulating width–depth scaling as controller synthesis over pre-collected reasoning trajectories and probe signals, leveraging beta parameterization and fine-grained execution trace feedback to enable tractable search and cheap evaluation without repeated LLM calls.
Key Contributions
- This work introduces AutoTTS, an environment-driven framework that reframes test-time scaling strategy design from manual heuristic tuning to automatic controller synthesis over a structured control space.
- The approach constructs a replay Markov decision process from pre-collected reasoning trajectories and probe signals to enable cheap controller evaluation without repeated model calls. Beta parameterization and fine-grained execution trace feedback further streamline the search process by ensuring tractable control and targeted failure diagnosis.
- Empirical evaluations demonstrate that the automatically discovered controllers generalize across held-out benchmarks and multiple model scales. These results confirm that environment-driven discovery provides a computationally efficient alternative to hand-crafted inference strategies.
Introduction
Test-time scaling enhances large language model performance by dynamically allocating additional computation during inference, making efficient resource distribution essential for balancing accuracy and operational cost. Existing approaches depend on manual heuristics for branching, pruning, and stopping reasoning trajectories, which forces researchers to tune thresholds by intuition and leaves much of the computation-allocation space unexplored. The authors leverage an environment-driven paradigm called AutoTTS to automate this discovery process. By formulating strategy design as controller synthesis over collected reasoning trajectories, they enable an agent to evaluate and refine allocation policies without repeated LLM calls. The framework introduces beta parameterization to keep the search tractable and provides detailed execution trace feedback to help the agent diagnose failures, ultimately discovering strategies that outperform manual baselines at a fraction of the computational cost.
Dataset
- Dataset composition and sources: The provided text does not describe a dataset, data sources, or subset compositions.
- Key details for each subset: No subset sizes, filtering rules, or source metadata are included.
- How the paper uses the data: The snippet implements a control flow mechanism rather than a training or inference pipeline.
- Processing details: The authors use a boolean evaluation to verify that all branches are either finished or abandoned, then break the execution loop when that condition is met.
Method
The authors frame test-time scaling (TTS) as an algorithmic search problem, where the goal is to discover an optimal policy for allocating a finite inference budget across multiple reasoning branches. This policy, or controller, dynamically manages the creation, extension, probing, pruning, and aggregation of branches to maximize accuracy while respecting the computational constraint. The overall framework, as illustrated in the figure below, consists of a discovery loop that iteratively evaluates candidate controllers against pre-collected offline data. This loop operates in a two-phase process: an offline replay environment enables fast and deterministic evaluation of any candidate controller without invoking the base language model (LLM), and a feedback mechanism provides detailed execution traces to guide the discovery agent. The agent, an LLM acting as a controller designer, analyzes the accumulated history of prior proposals, their performance, and their execution behaviors to propose improved controller implementations. This process continues over multiple rounds, with the best-performing controller and its associated hyperparameter selection ultimately chosen.
The core of the method is the controller's state representation and action space. The state st at decision step t encapsulates the entire history of the inference process for a given question q. It includes the number of instantiated branches mt, the set of currently active branches It, the depth (measured in fixed-length token intervals) of each branch ℓt, and the set of probe feedback Ωt that has been revealed so far. The controller's actions are defined by a set of atomic operations: BRANCH to create a new reasoning path, CONTINUE(i) to extend an active branch i, PROBE(i) to read the intermediate answer from branch i, PRUNE(i) to remove a branch, and ANSWER to terminate and produce a final answer. The controller's policy π(⋅∣s,β) maps the current state s and a single hyperparameter β to a distribution over these admissible actions. The parameter β acts as a high-level knob that controls the overall aggressiveness of the budget allocation strategy.
A key design choice is the use of a single hyperparameter β to control all internal decision-making thresholds, a concept known as beta parameterization. This simplifies the search space from a high-dimensional one to a one-dimensional sweep, preventing the agent from discovering brittle, overfit solutions. The controller's internal hyperparameters, such as the number of initial branches, pruning patience, and confidence thresholds, are all determined by smooth, monotonic functions of β. This ensures that as β increases, the controller becomes more aggressive, spending more budget on exploration and deeper reasoning, while maintaining a coherent and interpretable behavior. The discovered controller, termed the Confidence Momentum Controller (CMC), leverages this framework to implement four key mechanisms: a momentum-aware stopping gate that uses an Exponential Moving Average (EMA) of confidence to avoid premature termination on transient spikes; coupled width-depth control that links the decision to spawn new branches with the trend of confidence gain; alignment-aware depth allocation that prioritizes probing branches whose answers align with the current consensus; and a conservative branch abandonment policy that only discards branches after persistent deviation, ensuring at least two active branches are always preserved.
Experiment
The experiments evaluate a discovered reasoning controller across multiple Qwen3 models using offline replay environments, benchmarking it against established handcrafted scaling methods on both in-distribution and held-out mathematical and non-mathematical reasoning tasks. Main results and scaling analyses validate that the controller dynamically allocates computation to productive branches rather than merely cutting costs, while generalization tests confirm its robustness across different model families and task domains. Ablation studies further verify that parameterized budget control and detailed execution traces are critical for preventing search overfitting, ultimately demonstrating that the automated framework uncovers complex, coordinated decision-making strategies that consistently outperform manual design in balancing inference efficiency and accuracy.
The authors evaluate the performance of discovered controllers against handcrafted baselines on held-out benchmarks, demonstrating that the discovered controllers achieve a better accuracy-cost tradeoff across multiple models and tasks. The results show that the discovered controllers generalize well beyond the training set and outperform handcrafted methods in most settings, with significant reductions in token usage while maintaining competitive accuracy. The discovered controllers achieve a better accuracy-cost tradeoff compared to handcrafted baselines on held-out benchmarks. The discovered controllers generalize well to different models and non-math tasks, maintaining competitive performance with reduced token usage. The discovered controllers outperform handcrafted methods in most settings, with notable reductions in token consumption while preserving accuracy.
The authors evaluate a discovered controller against handcrafted baselines on multiple benchmarks, demonstrating that the discovered controller achieves a better accuracy-cost tradeoff across various models and datasets. The results show that the controller generalizes well to held-out environments and outperforms baselines in most settings, while also being robust to changes in model and task domains. Ablation studies indicate that key design choices, such as beta parameterization and execution traces, are critical for effective discovery and generalization. The discovered controller achieves a better accuracy-cost tradeoff compared to handcrafted baselines across multiple benchmarks and models. The controller generalizes well to held-out benchmarks and different model families, indicating robustness beyond the training setup. Beta parameterization and execution traces are essential for effective discovery, with their removal leading to significant performance degradation.
The authors evaluate a discovered controller against handcrafted baselines across multiple models and benchmarks, showing that the discovered controller achieves a better accuracy-cost tradeoff in most settings. The controller generalizes well to held-out benchmarks and outperforms handcrafted methods in terms of both accuracy and inference efficiency, particularly on smaller models. The results indicate that the discovered controller can effectively balance computation allocation to improve performance without excessive token usage. The discovered controller achieves superior accuracy-cost tradeoffs compared to handcrafted baselines across multiple models and benchmarks. The controller generalizes well to held-out datasets, outperforming baselines in most settings and maintaining strong performance on larger models. The discovered controller reduces token consumption significantly while preserving or improving accuracy, indicating efficient computation allocation.
The experiments evaluate discovered controllers against handcrafted baselines across diverse models and held-out benchmarks to validate their generalization capabilities and computational efficiency. Qualitative results demonstrate that the discovered controllers consistently achieve a superior accuracy-cost tradeoff by significantly reducing token consumption while maintaining or improving performance across various domains. Ablation studies further confirm that specific design choices, particularly beta parameterization and execution traces, are critical for driving effective discovery and robust cross-task adaptation. Overall, the findings establish that automated controller discovery provides a more efficient and adaptable alternative to traditional handcrafted methods.