Command Palette
Search for a command to run...
TAPS: Speculative Sampling을 위한 작업 인식 제안 분포
TAPS: Speculative Sampling을 위한 작업 인식 제안 분포
Mohamad Zbib Mohamad Bazzi Ammar Mohanna Hasan Abed Al Kader Hammoud Bernard Ghanem
초록
Speculative decoding 는 경량의 드래프트 모델이 미래 토큰을 제안하고, 이를 더 큰 타겟 모델이 병렬로 검증함으로써 자기회귀 생성을 가속화합니다. 그러나 실제 응용에서는 드래프트 모델이 광범위한 범용 말뭉치로 훈련되는 경우가 많아, 스펙큘레이티브 디코딩의 품질이 드래프트 훈련 분포에 어느 정도 의존하는지 불분명합니다. 본 연구에서는 MathInstruct, ShareGPT, 그리고 혼합 데이터 변형으로 훈련된 경량 HASS 및 EAGLE-2 드래프터들을 MT-Bench, GSM8K, MATH-500, SVAMP 에서 평가함으로써 이 질문을 탐구합니다. 수용 길이 (acceptance length) 를 기준으로 측정했을 때, 작업별 훈련은 명확한 전문화를 나타냅니다: MathInstruct 로 훈련된 드래프트는 추론 벤치마크에서 가장 강력하고, ShareGPT 로 훈련된 드래프트는 MT-Bench 에서 가장 우수합니다. 혼합 데이터 훈련은 견고성을 향상시키지만, 더 큰 혼합 비율이 모든 디코딩 온도에서 지배적인 성능을 보이지는 않습니다. 또한 추론 시 전문화된 드래프터들을 어떻게 결합할지 연구했습니다. 단순한 체크포인트 평균화는 성능이 낮았으나, 신뢰도 기반 라우팅은 단일 도메인 드래프트보다 개선된 결과를 보였고, 병합 트리 검증 (merged-tree verification) 은 두 가지 백본 모두에서 전체적으로 가장 긴 수용 길이를 달성했습니다. 마지막으로, 엔트로피보다 신뢰도가 더 유용한 라우팅 신호임을 확인했습니다: 거부된 토큰은 일반적으로 엔트로피가 높지만, 신뢰도는 벤치마크 수준의 라우팅 결정을 훨씬 명확하게 제공합니다. 이러한 결과는 스펙큘레이티브 디코딩의 품질이 드래프트 아키텍처뿐만 아니라 드래프트 훈련 데이터와 하류 작업 부하 간의 일치 여부에 의존하며, 전문화된 드래프터들은 가중치 공간에서보다 추론 시에 결합하는 것이 더 효과적임을 보여줍니다.
One-sentence Summary
Researchers from KAUST and the American University of Beirut propose TAPS, demonstrating that task-specific training of HASS and EAGLE-2 drafters significantly boosts speculative decoding acceptance on matched workloads. Their work reveals that combining specialized models via confidence-based routing or merged-tree verification at inference time outperforms naive weight averaging, optimizing LLM throughput for diverse domains like math and conversation.
Key Contributions
- The paper introduces an empirical analysis showing that task-specific training of draft models yields clear specialization, where MathInstruct-trained drafters excel on reasoning benchmarks while ShareGPT-trained drafters perform best on MT-Bench.
- This work demonstrates that combining specialized drafters at inference time via confidence-based routing and merged-tree verification significantly outperforms naive weight-space averaging, achieving the highest acceptance length across both HASS and EAGLE-2 backbones.
- Results indicate that confidence serves as a more effective routing signal than entropy for making benchmark-level decisions, as rejected tokens exhibit higher entropy but confidence provides clearer distinctions for selecting the optimal drafter.
Introduction
Autoregressive generation in LLMs faces a significant inference bottleneck that speculative decoding addresses by using a lightweight drafter to propose tokens for parallel verification by a larger target model. While prior work focuses on improving draft architectures or verification procedures, most draft models are trained on broad generic corpora, leaving the impact of training data distribution on acceptance quality under-explored. The authors leverage task-specific training on datasets like MathInstruct and ShareGPT to demonstrate that specialized drafters significantly outperform generic ones on matched benchmarks. They further show that combining these specialists at inference time through confidence-based routing and merged-tree verification yields superior results compared to naive weight averaging or mixed-data training.
Method
The authors leverage a speculative decoding framework where a lightweight draft model proposes future tokens for verification by a larger target LLM. As shown in the framework diagram, the process begins with the Target LLM providing context to the Draft Model. The draft model operates in latent space to propose tokens, which are then passed through an LM Head and sampling layer to generate unverified tokens like Xt+1 and Xt+2.
To enhance the quality of these drafts, the authors explore composition strategies for specialized models. One baseline approach is checkpoint averaging. As illustrated in the figure below, parameters from distinct draft models, such as one trained on ShareGPT data and another on Math data, are combined via point-wise averaging to create a single merged draft model.
Alternatively, the authors investigate inference-time composition strategies that maintain separate specialized checkpoints. In this setting, specialized models generate distinct candidate continuations with associated confidence scores, as depicted in the tree diagrams showing separate branches for different experts.
For inference-time selection, the authors propose confidence routing. This method generates separate draft trees from different checkpoints and selects the tree with the higher mean node confidence before verification, as shown in the routing diagram where the max confidence path is chosen.
A more comprehensive strategy is merged-tree verification. Instead of selecting a single tree, the method packs multiple draft trees under a shared root. This allows the verifier to evaluate candidates from all specialists in a single parallel pass. The flattened merged-tree input preserves ancestry through tree attention masks and depth-based position ids, enabling the verifier to process both specialized subtrees without cross-subtree attention.
Experiment
- Single-domain training validates that drafters achieve significantly higher acceptance lengths when their training distribution matches the target workload, with mathematical models excelling on reasoning tasks and conversational models on dialogue benchmarks.
- Mixed-data training demonstrates that combining domains improves cross-domain robustness, though the optimal mixture ratio depends on the decoding temperature and does not guarantee uniform generalization.
- Inference-time composition strategies, specifically confidence-based routing and merged-tree verification, substantially outperform weight-space averaging, proving that keeping specialized models separate and combining them at runtime is more effective than merging parameters.
- Analysis of confidence, entropy, and speculative depth reveals that confidence is the superior signal for routing between specialists, while deeper speculative steps increasingly favor task-matched experts over broad-coverage models.
- The overall conclusion establishes that proposal quality in speculative decoding is a function of both architecture and training distribution, necessitating task-aware drafting and dynamic composition rather than static, averaged checkpoints.