HyperAIHyperAI

Command Palette

Search for a command to run...

DTS: 추론 트리 스케치를 통한 대규모 추론 모델의 성능 향상

Zicheng Xu Guanchu Wang Yu-Neng Chuang Guangyao Zheng Alexander S. Szalay Zirui Liu Vladimir Braverman

초록

대규모 추론 모델(LRMs)은 복잡한 추론 작업에서 뛰어난 성능을 보이지만, 종종 과도한 사고(overthinking) 문제를 겪으며, 추론 비용을 증가시키고 정확도를 저하시킬 수 있는 지나치게 긴 사고 체인(CoT) 추적을 생성한다. 우리의 분석 결과, 추론 길이와 정확도 사이에는 명확한 반비례 관계가 존재함을 확인하였다. 다양한 확률적 디코딩 실험에서 짧은 추론 경로는 일관되게 가장 높은 정확도를 기록하는 반면, 긴 경로는 오류와 반복이 누적되는 경향을 보였다. 이러한 최적의 짧은 추론 경로는 이상적으로 추론 공간의 전체 탐색을 통해 발견할 수 있으나, 트리 구조를 가진 추론 공간은 시퀀스 길이에 따라 지수적으로 증가하므로 포괄적인 탐색은 현실적으로 불가능하다. 이를 해결하기 위해 우리는 추가적인 학습이나 감독 없이도 효율성과 정확도를 동시에 향상시킬 수 있는 모델 독립형 디코딩 프레임워크인 DTS를 제안한다. DTS는 엔트로피가 높은 토큰에서만 선택적으로 분기하여 추론 공간을 요약하고, 조기 중단(early stopping) 기법을 적용하여 가장 짧은 완료된 추론 경로를 선정한다. 이 방법은 최적 해를 근사함으로써 효율적이고 정확한 추론을 가능하게 한다. DeepSeek-R1-Distill-Qwen-7B 및 1.5B 모델을 활용한 AIME2024 및 AIME2025 데이터셋에 대한 실험 결과, DTS는 정확도를 최대 8% 향상시키고, 평균 추론 길이를 23% 감소시키며, 반복 빈도를 12% 줄이는 성과를 보였다. 이는 DTS가 확장 가능하고 효율적인 대규모 추론 모델의 추론을 가능하게 한다는 점을 입증한다.

Summarization

Researchers from Rice University, University of Minnesota, and Johns Hopkins University propose DTS, a model-agnostic decoding framework that reduces overthinking in Large Reasoning Models by selectively exploring high-entropy decision points and prioritizing shorter reasoning paths. DTS improves accuracy by up to 8% and cuts reasoning length by 23% without retraining, enabling more efficient and accurate AI reasoning on complex tasks.

Key Contributions

  • DTS introduces a training-free, model-agnostic decoding framework that reduces overthinking in Large Reasoning Models by selectively exploring high-entropy decision points in the reasoning process.
  • It constructs a compact decoding tree using parallel auto-regressive generation and applies early stopping to identify the shortest complete and accurate reasoning path.
  • Experiments show DTS improves accuracy by up to 8%, reduces average reasoning length by 23%, and decreases repetition frequency by 12% on AIME2024 and AIME2025 benchmarks.

Introduction

Large Reasoning Models (LRMs) excel at complex tasks by generating step-by-step chain-of-thought (CoT) reasoning, but they often suffer from overthinking—producing long, redundant reasoning paths that increase inference cost and hurt accuracy. Prior work has attempted to address this through training-based methods like supervised fine-tuning or reinforcement learning on compressed or length-penalized data, or via adaptive pruning mechanisms. However, these approaches require additional labeled data and training, limiting scalability, while train-free methods often lack consistent performance gains.

The authors leverage the observation that shorter reasoning paths are empirically more accurate, forming a tree-structured reasoning space during autoregressive generation where optimal paths are short but buried in an exponentially large search space. To efficiently approximate the best path without training, they propose DTS (Decoding Tree Sketching), a model-agnostic decoding framework that dynamically constructs a compact reasoning tree at inference time.

  • Uses next-token entropy to selectively branch only at high-uncertainty tokens, reducing search complexity.
  • Applies early stopping to return the shortest completed reasoning path, aligning with the observed accuracy-length anti-correlation.
  • Operates entirely at decoding time with GPU parallelism, enabling training-free, plug-and-play deployment across models.

Method

The authors leverage a novel decoding strategy called Decoding Tree Sketching (DTS) to efficiently identify the shortest reasoning path in Large Reasoning Models (LRMs), capitalizing on the observed anti-correlation between reasoning length and accuracy. Rather than exhaustively exploring the exponentially growing space of all possible reasoning sequences, DTS constructs a pruned decoding tree that selectively expands branches only at high-uncertainty tokens, thereby approximating the optimal short path while maintaining computational feasibility.

The core mechanism of DTS hinges on an adaptive branch function F(x,ξ)F(x, \xi)F(x,ξ) that determines whether to generate a single token or spawn multiple branches at each decoding step. This decision is governed by the entropy H(v)H(v)H(v) of the next-token distribution P(v)=f(x,ξ)P(v) = f(x, \xi)P(v)=f(x,ξ), where fff denotes the LRM. When H(v)τH(v) \geq \tauH(v)τ, indicating high uncertainty, DTS selects the top-KKK most probable tokens to initiate new branches; otherwise, it samples a single token. Formally:

F(x,ξ)={{v1,,vKpv1,,pvKp~K}if H(v)τ,{v1}, v1P(v)if H(v)<τ,F(x, \xi) = \begin{cases} \{ v_1, \dots, v_K \mid p_{v_1}, \dots, p_{v_K} \geq \tilde{p}_K \} & \text{if } H(v) \geq \tau, \\ \{ v_1 \},\ v_1 \sim P(v) & \text{if } H(v) < \tau, \end{cases}F(x,ξ)={{v1,,vKpv1,,pvKp~K}{v1}, v1P(v)if H(v)τ,if H(v)<τ,

where p~K\tilde{p}_Kp~K is the KKK-th largest probability in P(v)P(v)P(v). This entropy-based gating allows DTS to focus computational resources on regions of the reasoning space where the model is uncertain, while proceeding deterministically in confident regions.

As shown in the figure below, the decoding tree grows in a breadth-first manner, with each node representing a token and edges denoting transitions. Branching occurs only at steps t1t_1t1 and t2t_2t2, where entropy exceeds the threshold τ\tauτ, and the top two tokens are selected for expansion. Low-entropy steps proceed linearly, preserving efficiency.

At each time step ttt, DTS maintains a set of active reasoning sequences Tt\mathcal{T}_tTt, initialized as T0=\mathcal{T}_0 = \varnothingT0=. For each sequence ξTt\xi \in \mathcal{T}_tξTt, the model applies F(x,ξ)F(x, \xi)F(x,ξ) to generate next tokens, which are appended to form new sequences. The set is then updated as:

Tt+1={ξviviF(x,ξ), ξTt}.\mathcal{T}_{t+1} = \{ \xi \oplus v_i \mid v_i \in F(x, \xi),\ \xi \in \mathcal{T}_t \}.Tt+1={ξviviF(x,ξ), ξTt}.

This process continues iteratively, with all branches generated in parallel to exploit GPU acceleration, ensuring scalability.

Early termination is triggered as soon as any branch emits the ending token e\langle e \ranglee, following the principle that shorter reasoning paths yield higher accuracy. Formally, DTS stops at step ttt if ξTt1[eξ]\bigvee_{\xi \in \mathcal{T}_t} \mathbb{1}[\langle e \rangle \in \xi]ξTt1[⟨eξ], and returns the first completed sequence as the final output.

An illustrative example is shown in the figure below, where DTS processes the prompt “What’s the area of a rectangle with length 12 and width 9?”. Branching occurs at steps t1t_1t1 and t2t_2t2, generating multiple reasoning paths. The purple branch terminates first with the correct answer “area= 12×9=108”, which is returned as the final output.

The algorithm follows a breadth-first search over the sketched tree, guaranteeing that the shortest valid reasoning path is identified. All active branches are expanded in parallel, enabling efficient and scalable inference without sacrificing the quality of the reasoning output.

Experiment

    Respond strictly in English.

The authors use 100 stochastic decodes per AIME24 problem to evaluate reasoning trajectories, finding that selecting the shortest response yields 76.67% accuracy with significantly fewer tokens than the longest or mean responses. Results show a strong anti-correlation between response length and accuracy, indicating that verbose reasoning degrades performance. This supports the motivation for DTS, which prioritizes shorter, more efficient reasoning paths to improve both accuracy and efficiency.

The authors use the DTS framework to improve reasoning performance and efficiency for DeepSeek-R1-Distill-Qwen models on AIME2024 and AIME2025. Results show DTS consistently increases accuracy by 4% to 8% while reducing response length by 17% to 29% compared to standard inference, with the 7B model achieving a 7.66% average accuracy gain and 22.96% length reduction. These improvements hold across both model sizes and datasets, demonstrating DTS’s effectiveness in balancing performance and efficiency without training.

The authors use DTS to reduce endless repetition in reasoning trajectories, showing that it lowers repetition rates across both AIME2024 and AIME2025 benchmarks for 7B and 1.5B models. Results show DTS cuts repetition from 6.7% to 1.3% on AIME2024 for the 7B model and from 26.7% to 6.0% on AIME2025 for the 1.5B model. This confirms DTS effectively prunes repetitive paths by favoring shorter, completed reasoning traces.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp