HyperAIHyperAI

Command Palette

Search for a command to run...

아르비트리지: 이점 인지 사전 추측을 통한 효율적 추론

초록

현대의 대규모 언어 모델은 긴 사고 과정(Chain of Thoughts)을 통해 놀라운 추론 능력을 달성하지만, 추론 과정에서 상당한 계산 비용이 발생한다. 이에 따라 성능 대비 비용 비율을 향상시키기 위한 기법들이 등장하고 있다. 그 중에서도 사전 추측 추론(Speculative Decoding)은 빠르지만 정확도가 낮은 초안 모델(draft model)을 자동 회귀적으로 토큰을 제안하고, 이를 더 강력한 타겟 모델(target model)이 병렬로 검증함으로써 추론 속도를 가속화한다. 그러나 전통적인 토큰 단위 사전 추측 추론은 의미적으로 동일한 단계에서 토큰 불일치로 인해 불필요한 거부(rejection)가 발생하여 추론 작업에서 어려움을 겪는다. 최근 연구들은 이러한 문제를 해결하기 위해 단계 단위의 의미 검증(step-level semantic verification)으로 전환하였으며, 전체 추론 단계를 수용하거나 거부함으로써 효율성을 높였다. 그러나 기존의 단계 단위 방법들은 여전히 많은 거부된 단계를 재생성하며, 성능 향상은 미미하여 소중한 타겟 모델의 계산 자원을 낭비한다. 본 연구에서는 이러한 문제를 해결하기 위해, 초안 모델과 타겟 모델 간의 상대적 우위에 따라 생성 과정을 동적으로 라우팅하는 새로운 단계 단위 사전 생성 프레임워크인 Arbitrage를 제안한다. 고정된 수용 임계값을 적용하는 대신 Arbitrage는 타겟 모델이 의미적으로 더 우수한 단계를 생성할 가능성이 높을 때를 예측할 수 있도록 가벼운 라우터를 학습시킨다. 이 라우팅은 항상 더 높은 품질의 단계를 선택하는 이상적인 Arbitrage 오라클을 근사하며, 거의 최적의 효율성-정확도 균형을 달성한다. 다양한 수학 추론 벤치마크에서 Arbitrage는 기존의 단계 단위 사전 추측 추론 기법들을 일관되게 능가하며, 정확도를 동일하게 유지하면서 추론 지연 시간을 최대 약 2배까지 감소시켰다.

One-sentence Summary

UC Berkeley, Apple, ICSI, and LBNL propose ARBITRAGE, a step-level speculative decoding framework that uses a lightweight router to dynamically select higher-quality reasoning steps between draft and target models, improving efficiency by reducing wasteful regeneration and cutting inference latency by up to 2× on mathematical reasoning tasks compared to prior methods.

Key Contributions

  • Existing step-level speculative decoding methods often waste computation by regenerating reasoning steps that yield little quality improvement, as their fixed acceptance thresholds do not account for the relative performance of draft and target models.
  • ARBITRAGE introduces a dynamic routing mechanism that uses a lightweight router to predict when the target model is likely to produce a meaningfully better reasoning step, enabling more efficient use of target model compute.
  • Evaluated on multiple mathematical reasoning benchmarks, ARBITRAGE reduces inference latency by up to ~2× compared to prior step-level methods while maintaining or improving output accuracy.

Introduction

Large language models (LLMs) achieve strong performance on complex reasoning tasks using long chain-of-thought (CoT) generation, but the auto-regressive nature of token decoding creates a memory-bound inference bottleneck, especially for lengthy reasoning sequences. Speculative Decoding (SD) addresses this by using a fast draft model to propose tokens or steps, which a more capable target model verifies in parallel, thereby improving throughput. While step-level SD—verifying entire reasoning steps instead of individual tokens—improves acceptance rates and robustness, existing methods like Reward-guided SD (RSD) rely on absolute quality thresholds to decide when to regenerate with the target model, leading to frequent and often unnecessary regenerations that waste compute without meaningful quality gains.

The authors leverage a key insight: routing decisions should depend not on the draft’s absolute quality, but on the expected advantage of the target model over the draft for a given step. They propose ARBITRAGE, a step-level speculative generation framework that introduces a lightweight router trained to predict when the target model is likely to produce a meaningfully better reasoning step than the draft. This router approximates an ideal ARBITRAGE ORACLE that always selects the higher-quality step, enabling dynamic, advantage-based routing. By avoiding costly target regenerations when gains are marginal, ARBITRAGE reduces redundant computation and improves the efficiency-accuracy trade-off. Experiments show up to ~2× latency reduction over prior step-level SD methods at matched accuracy across mathematical reasoning benchmarks.

Dataset

  • The authors use a step-level dataset constructed from the NuminaMath-CoT dataset, from which 30,000 questions are selected via stratified sampling to serve as the seed for fine-tuning.

  • For each question context x, the draft and target models are decoded from the same prefix to generate paired reasoning steps (z_d, z_t). The authors compute PRM scores s_d and s_t using a fixed PRM model, then calculate the step-level advantage Δ and derive the oracle label γ = I[Δ > 0], indicating whether using the target model improves output quality.

  • To reduce variance in the oracle signal, multiple target samples may be drawn per context; their PRM scores are averaged to produce \bar{s}_t and \bar{Δ}, from which the final oracle label y is computed.

  • The resulting training tuples (x, z_d, z_t, s_d, s_t, Δ, y) form a supervised dataset for training the router model.

  • Due to class imbalance—where most draft steps are acceptable (y = 0)—the authors apply random downsampling to the majority class (y = 0) to balance the dataset and mitigate bias toward accepting draft steps.

  • Additional preprocessing includes annotating each step with the model that generated it, normalizing sequence lengths, and standardizing the step separator token to \n\n to ensure consistency between PRM scoring and router inputs.

Method

The authors leverage a step-level speculative decoding framework called ARBITRAGE, which dynamically routes between a lightweight draft model and a more powerful target model based on predicted quality advantage. Unlike classical speculative decoding that relies on absolute reward thresholds from a Process Reward Model (PRM), ARBITRAGE introduces a lightweight router that estimates whether regenerating a step with the target model will yield a higher PRM score than the draft’s output — thereby avoiding wasteful target invocations.

At each reasoning step, the draft model generates a candidate step zdz_dzd conditioned on the current context xxx, terminating upon emitting a separator token. This step is then evaluated by the ARBITRAGE ROUTER, which outputs a scalar y^=hθrouter(x,zd)\hat{y} = h_{\theta_{\text{router}}}(x, z_d)y^=hθrouter(x,zd) representing the predicted likelihood that the target model’s step ztz_tzt would outperform zdz_dzd under the PRM. The decision to accept or escalate is governed by a tunable threshold τ\tauτ: if y^τ\hat{y} \leq \tauy^τ, the draft step is accepted; otherwise, the target model regenerates the step from the same prefix.

Refer to the framework diagram, which illustrates the two possible execution paths: in Step 1, the router predicts no advantage from escalation (y^<τ\hat{y} < \tauy^<τ), so the draft step is accepted and the system proceeds; in Step 2, the router predicts a meaningful advantage (y^>τ\hat{y} > \tauy^>τ), triggering target regeneration before proceeding.

The router is trained offline to approximate the ARBITRAGE ORACLE — a theoretically optimal but computationally infeasible policy that compares the counterfactual PRM scores sds_dsd and sts_tst for the same context. The oracle selects the step with higher reward: z=argmaxz{zd,zt}hθPRM(x,z)z^* = \arg\max_{z \in \{z_d, z_t\}} h_{\theta_{\text{PRM}}}(x, z)z=argmaxz{zd,zt}hθPRM(x,z). The advantage Δ=stsd\Delta = s_t - s_dΔ=stsd quantifies the target’s potential gain over the draft. The oracle’s optimal routing policy is aτ=I{Δ>τ}a^*_\tau = \mathbb{I}\{\Delta > \tau\}aτ=I{Δ>τ}, which maximizes expected quality under a fixed escalation budget.

As shown in the figure below, ARBITRAGE avoids the “wasted regeneration” problem inherent in RSD: under RSD, Step 3 and Step 4 are regenerated despite yielding lower or equal PRM scores than their draft counterparts, whereas ARBITRAGE only regenerates Step 3 — where the predicted advantage is positive — and accepts Step 4 without regeneration, since the router predicts no improvement.

The router’s prediction enables fine-grained control over the compute-quality trade-off via τ\tauτ, while introducing only a single forward pass per step. This design preserves the efficiency of speculative decoding while significantly reducing redundant computation, as the router approximates the oracle’s advantage-aware decisions without executing the target model during inference.

Experiment

  • Empirical analysis shows RSD incurs up to 40% wasted target model calls at 70% deferral rate, with no quality gain due to regeneration on low-scoring but correct draft steps or shared draft-target failure modes.
  • ARBITRAGE ROUTER is trained using a 1.5B PRM checkpoint with step-level annotations and class-balanced downsampling, achieving higher Spearman correlation (ρ = 0.1673) and balanced accuracy by addressing label imbalance.
  • On MATH500 and OlympiadBench, ARBITRAGE ROUTER outperforms RSD across LLaMA3 (1B/8B), LLaMA3 (8B/70B), and Qwen2.5-Math (3bit-7B/7B), achieving higher accuracy at comparable acceptance rates, closely tracking the oracle.
  • Ablations confirm binary classification with step annotations yields best performance: it improves Spearman correlation and label-1 accuracy over non-annotated and multi-class variants.
  • ARBITRAGE achieves up to 1.62× lower latency on MATH500 and up to 1.97× speedup on OlympiadBench at matched accuracy, demonstrating superior compute-quality trade-off by reducing unnecessary target model invocations.

The authors evaluate the impact of incorporating historical routing context into the ARBITRAGE router, finding that annotated inputs—which include prior model choices—improve both Spearman correlation and accuracy for escalation decisions. Results show that the annotated variant achieves higher correlation (0.1508 vs. 0.1305) and better label-1 accuracy (72.96% vs. 69.58%), indicating that routing history enhances the model’s ability to identify steps where target escalation is beneficial.

The authors evaluate different label granularities for the ARBITRAGE router and find that the 2-class classification variant achieves the highest Spearman correlation with oracle advantage scores and balanced accuracy across both classes. Increasing the number of classes to 4 or 10 reduces overall correlation and introduces label skew, indicating that binary routing provides the most robust trade-off for practical deployment.

The authors evaluate the impact of class-balanced downsampling on router training, finding that balancing the dataset improves Spearman correlation and label-1 accuracy while reducing bias toward the majority accept class. Without downsampling, the router becomes overconfident in accepting draft steps, leading to under-escalation and worse overall routing quality.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp