4달 전

Liyan Xu Mo Yu Fandong Meng Jie Zhou

초록

이 연구는 사고의 사슬(Chain-of-Thought, CoT)의 동역학에 대한 이전의 보완적 관찰에서 비롯된다. 즉, 대규모 언어 모델(Large Language Models, LLMs)은 CoT가 나타나기 전에 후속 추론에 대한 은닉된 계획을 수행하는 것으로 나타났으며, 이는 명시적 CoT의 중요성을 감소시킨다. 그러나 다단계 추론이 필요한 작업에서는 여전히 CoT가 핵심적인 역할을 한다. LLM의 내부 상태와 그 구두화된 추론 경로 사이의 관계를 심화 이해하기 위해, 본 연구는 다양한 작업 도메인의 은닉 상태에 대해 우리 고안한 탐색 방법인 Tele-Lens를 적용하여 LLM의 은닉 계획 강도를 탐구한다. 실증 결과에 따르면, LLM은 전반적인 계획 없이 주로 점진적인 전이를 수행하는 단기적 시야(미시적 시야, myopic horizon)를 보인다. 이러한 특성을 활용하여, CoT의 불확실성 추정을 향상시키는 가설을 제안하며, 실험을 통해 전체 경로의 불확실성을 효과적으로 대표할 수 있는 CoT 위치의 소수 집합이 존재함을 검증하였다. 또한 CoT의 동역학을 적극적으로 활용할 필요성을 강조하고, 성능 저하 없이 CoT 회피를 자동으로 인식할 수 있음을 입증하였다. 본 연구의 코드, 데이터 및 모델은 https://github.com/lxucs/tele-lens 에 공개되어 있다.

One-sentence Summary

Liyan Xu et al. from Tsinghua University propose Tele-Lens to probe LLMs’ latent planning, revealing myopic reasoning; they show sparse CoT positions suffice for uncertainty estimation and enable CoT bypass without performance loss, advancing efficient reasoning in multi-step tasks.

Key Contributions

We introduce Tele-Lens, a probing method that analyzes LLM hidden states across 12 diverse datasets to reveal that models exhibit a myopic planning horizon, favoring local transitions over global reasoning plans, especially in complex multi-step tasks.
Leveraging this myopic behavior, we propose and validate the “Wooden Barrel” hypothesis: uncertainty in CoT reasoning is best captured by a small subset of pivot positions, achieving up to 6% improvement in uncertainty estimation without full-path computation.
We demonstrate that CoT bypass—automatically skipping unnecessary reasoning steps—can be reliably detected and applied without degrading performance, highlighting the practical value of modeling CoT dynamics for efficiency and calibration.

Introduction

The authors leverage probing techniques to investigate whether large language models (LLMs) internally plan entire reasoning chains before generating Chain-of-Thought (CoT) outputs — a key question given CoT’s role in enabling complex, multi-step reasoning. Prior work presents conflicting views: some suggest early hidden states encode future reasoning paths, while others argue CoT remains essential due to architectural limits of Transformers. The authors introduce Tele-Lens, a low-rank adapter that probes hidden states across 12 diverse tasks, revealing that LLMs exhibit a myopic planning horizon — they primarily support local transitions rather than global plans, except for simple tasks where early states hint at coarse answer gists. Building on this, they propose and validate two applications: a “Wooden Barrel” hypothesis for uncertainty estimation (focusing on pivot CoT positions improves accuracy by up to 6%) and a method to automatically bypass CoT when unnecessary, achieving 16.2% bypass rate with negligible performance loss.

Dataset

The authors use a diverse, multi-task dataset spanning 12 tasks grouped into three categories: Explicit Compositional, Implicit Compositional, and Knowledge and Semantic Tasks.

Explicit Compositional Tasks (3 tasks, synthetically generated):
- Parity: Random digit sequences (length 5–100), target digit from {1,2,7,8}, label = parity of count.
- Cycle: Random edge lists (4–100 edges), generates single or dual cycles, labels based on path existence between two randomly selected vertices.
- Subsum: Random integer lists (length 2–50, values 1–9), label = least significant digit of max subsequence sum via DP.
- All three tasks are fully controllable, with balanced label distributions.
Implicit Compositional Tasks (5 tasks, adapted from existing datasets):
- Math: GSM8K, MATH, AIME — originally free-form; converted to multiple-choice using GPT-4.1 to generate 4 distractors per problem.
- Logic: MuSR, Zebra — natural language reasoning tasks with soft or symbolic constraints.
- MATH uses the MATH-500 test split; AIME’25 includes all 30 problems in test set only.
Knowledge and Semantic Tasks (4 tasks, sampled from existing benchmarks):
- CSQA, MMLU, QuALITY, GPQA — focus on knowledge retrieval and semantic understanding.
- QuALITY uses RAG-style snippets (max 2K context) for efficiency.
- All multiple-choice tasks have answer options shuffled to reduce positional bias.
Dataset Splits and Processing:
- Each task has up to 4K train / 100 dev / 500 test problems.
- Train/dev splits for non-synthetic tasks sample from original test sets; if insufficient, draw from train/dev sets.
- Final answer probing uses a fixed 20-token label set: {A–E, F, YES, NO, even, odd, 0–9}.
Model Use and Metadata:
- Used to train Tele-Lens adapters per Transformer layer (rank 256) for ~5K steps with early stopping.
- Hidden states collected from CoT rollouts (max length 16,384 for test; 5–10% sampled for train/dev to reduce storage).
- Dataset sizes per layer: Off-the-Shelf LLM — 2.4M train / 81K dev / 11M test hidden states; In-Domain LLM — 2.5M / 57K / 2.7M.
- Labels encode teleological dimensions: next token ID, final answer token, CoT length, etc.

Method

The authors leverage a probing mechanism called Tele-Lens to extract teleological signals from intermediate hidden states of large language models during chain-of-thought (CoT) reasoning. This method extends the Logit Lens paradigm by introducing a low-rank adapter with nonlinearity to transform hidden states into vocabulary-scale predictions while minimizing computational overhead and overfitting. For each token $t_i$ in a reasoning trajectory $T = \{t_1, t_2, .., t_n\}$ , the hidden state $H_i^k$ at layer $k$ is transformed via a bottleneck adapter into $\widetilde{H}_i^k$ , which is then projected through the frozen language model head to yield a probability distribution $\mathcal{P}_i^k$ over the vocabulary $\mathcal{V}$ :

\begin{array} { r } { \widetilde { H } _ { i } ^ { k } = \mathrm { G e L U } \Big ( \big ( H _ { i } ^ { k } + \mathrm { E m b } ^ { k } ( \delta ) \big ) \, A ^ { k } \Big ) \, B ^ { k } } \\ { \mathcal { P } _ { i } ^ { k } ( \mathcal { V } \mid t _ { i } , A ^ { k } , B ^ { k } , \mathrm { E m b } ^ { k } , \delta ) = \mathrm { S o f t m a x } \big ( \widetilde { H } _ { i } ^ { k } L \big ) } \end{array}

Here, $A^{k} \in \mathbb{R}^{d \times r}$ and $B^{k} \in \mathbb{R}^{r \times d}$ are low-rank adapter matrices, and $\mathrm{Emb}^{k}$ optionally encodes positional offsets $\delta$ to predict future tokens. The frozen LM head $L \in \mathbb{R}^{d \times |\mathcal{V}|}$ ensures alignment with the model’s native output space.

Tele-Lens probes three distinct teleological dimensions per hidden state: subsequent token prediction (using offset-aware embeddings), reasoning length estimation (via a regression head on $\widetilde{H}_i^k$ ), and direct final answer prediction (omitting $\mathrm{Emb}^{k}$ ). This enables fine-grained analysis of how internal representations evolve toward task completion.

As shown in the figure below, the probability distribution across CoT positions reveals how Tele-Lens identifies critical reasoning steps — for instance, detecting the final count of digit '8' in a sequence and aligning high confidence with the concluding assertion. This illustrates how the method captures not just token-level predictions but also the teleological structure of reasoning.

The framework operates on two types of LLM backbones: off-the-shelf models like Qwen3-32B, which natively support CoT, and in-domain LLMs trained via GRPO reinforcement learning from Qwen2.5-7B-Instruct. The latter provides a controlled environment to study task-specific reasoning dynamics without confounding factors from general-purpose architectures.

Refer to the framework diagram above, which illustrates a CoT trace for a graph pathfinding problem. The model explicitly enumerates edge traversals from source to target vertex, culminating in a binary answer. Tele-Lens can probe each intermediate step — such as the state after step 31 — to predict the final answer or estimate remaining reasoning length, thereby exposing the internal planning structure embedded in the hidden states.

Experiment

LLMs exhibit a myopic planning horizon, with precise final-answer planning emerging only near the end of reasoning, not at the start, especially for compositional tasks like Parity and Cycle.
Early hidden states may show coarse signals hinting at the answer gist, particularly in semantic tasks like CSQA, but these signals reflect vague perception rather than structured planning and yield lower accuracy than direct answering or full CoT.
LLMs show limited foresight over subsequent reasoning steps, with prediction accuracy declining sharply beyond the next two tokens, except in structurally modular tasks where patterns are discernible.
Global reasoning length is poorly predicted early in CoT; apparent correlations in some tasks stem from observable input heuristics rather than genuine planning.
Training an in-domain LLM yields shorter, more decisive CoT trajectories and competitive performance despite smaller scale, validating effective induction of stable reasoning paths.
Uncertainty in CoT can be better estimated by focusing on a few critical “pivot” tokens rather than averaging across the full trajectory, improving calibration significantly.
Early CoT signals can identify when CoT is unnecessary, enabling safe bypass for simpler tasks without harming overall accuracy, reducing computational load.

The authors use Tele-Lens to probe LLM hidden states for their ability to predict subsequent tokens along CoT trajectories, revealing that prediction accuracy declines sharply beyond the next one or two steps across most tasks. While structural tasks like Parity and Cycle show slightly more sustained foresight, the overall pattern indicates LLMs lack long-term planning capacity and primarily operate with a myopic horizon. This limited foresight holds across both in-domain and off-the-shelf models, suggesting it is a fundamental characteristic of current LLM reasoning dynamics.

The authors use a top-k pivot selection strategy based on latent signals from CoT trajectories to improve uncertainty estimation, achieving up to 9% absolute AUROC gain over full-path baselines. Results show that focusing on a sparse subset of critical reasoning steps yields more reliable uncertainty metrics than aggregating signals across the entire chain. This approach consistently enhances performance across multiple metrics and model sizes, supporting the hypothesis that reasoning uncertainty is governed by key logical leaps rather than overall token confidence.

The authors use Tele-Lens to extract latent signals from critical positions in CoT trajectories, finding that focusing on a sparse subset of high-confidence tokens significantly improves uncertainty estimation. Results show that selecting just the top 5 pivot positions yields the best calibration, outperforming both full-trajectory averages and standard metrics like perplexity or entropy. This supports the hypothesis that reasoning uncertainty is governed by a few decisive steps rather than the entire chain.

The authors use Tele-Lens to probe latent planning in LLMs by analyzing final-answer prediction accuracy across early CoT positions and transformer layers. Results show that precise final-answer planning is largely absent at the start of reasoning, emerging only near completion for compositional tasks, while semantic tasks may show early but vague predictive signals that do not translate to better task performance. This indicates LLMs operate with a myopic planning horizon, relying on step-by-step exploration rather than global foresight.

The authors use a top-k pivot selection strategy based on internal token-level signals to estimate reasoning uncertainty, finding that focusing on a sparse subset of critical positions significantly improves calibration over full-trajectory averaging. Results show consistent gains across both Qwen3-8B and Qwen3-32B models, with up to 9% absolute improvement in AUROC when using just five pivot tokens. This supports the hypothesis that uncertainty in chain-of-thought reasoning is governed by a few decisive logical steps rather than the entire trajectory.

소스 PDF

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

4달 전

Liyan Xu Mo Yu Fandong Meng Jie Zhou

초록

One-sentence Summary

Key Contributions

We introduce Tele-Lens, a probing method that analyzes LLM hidden states across 12 diverse datasets to reveal that models exhibit a myopic planning horizon, favoring local transitions over global reasoning plans, especially in complex multi-step tasks.
Leveraging this myopic behavior, we propose and validate the “Wooden Barrel” hypothesis: uncertainty in CoT reasoning is best captured by a small subset of pivot positions, achieving up to 6% improvement in uncertainty estimation without full-path computation.
We demonstrate that CoT bypass—automatically skipping unnecessary reasoning steps—can be reliably detected and applied without degrading performance, highlighting the practical value of modeling CoT dynamics for efficiency and calibration.

Introduction

Dataset

The authors use a diverse, multi-task dataset spanning 12 tasks grouped into three categories: Explicit Compositional, Implicit Compositional, and Knowledge and Semantic Tasks.

Explicit Compositional Tasks (3 tasks, synthetically generated):
- Parity: Random digit sequences (length 5–100), target digit from {1,2,7,8}, label = parity of count.
- Cycle: Random edge lists (4–100 edges), generates single or dual cycles, labels based on path existence between two randomly selected vertices.
- Subsum: Random integer lists (length 2–50, values 1–9), label = least significant digit of max subsequence sum via DP.
- All three tasks are fully controllable, with balanced label distributions.
Implicit Compositional Tasks (5 tasks, adapted from existing datasets):
- Math: GSM8K, MATH, AIME — originally free-form; converted to multiple-choice using GPT-4.1 to generate 4 distractors per problem.
- Logic: MuSR, Zebra — natural language reasoning tasks with soft or symbolic constraints.
- MATH uses the MATH-500 test split; AIME’25 includes all 30 problems in test set only.
Knowledge and Semantic Tasks (4 tasks, sampled from existing benchmarks):
- CSQA, MMLU, QuALITY, GPQA — focus on knowledge retrieval and semantic understanding.
- QuALITY uses RAG-style snippets (max 2K context) for efficiency.
- All multiple-choice tasks have answer options shuffled to reduce positional bias.
Dataset Splits and Processing:
- Each task has up to 4K train / 100 dev / 500 test problems.
- Train/dev splits for non-synthetic tasks sample from original test sets; if insufficient, draw from train/dev sets.
- Final answer probing uses a fixed 20-token label set: {A–E, F, YES, NO, even, odd, 0–9}.
Model Use and Metadata:
- Used to train Tele-Lens adapters per Transformer layer (rank 256) for ~5K steps with early stopping.
- Hidden states collected from CoT rollouts (max length 16,384 for test; 5–10% sampled for train/dev to reduce storage).
- Dataset sizes per layer: Off-the-Shelf LLM — 2.4M train / 81K dev / 11M test hidden states; In-Domain LLM — 2.5M / 57K / 2.7M.
- Labels encode teleological dimensions: next token ID, final answer token, CoT length, etc.

Method

\begin{array} { r } { \widetilde { H } _ { i } ^ { k } = \mathrm { G e L U } \Big ( \big ( H _ { i } ^ { k } + \mathrm { E m b } ^ { k } ( \delta ) \big ) \, A ^ { k } \Big ) \, B ^ { k } } \\ { \mathcal { P } _ { i } ^ { k } ( \mathcal { V } \mid t _ { i } , A ^ { k } , B ^ { k } , \mathrm { E m b } ^ { k } , \delta ) = \mathrm { S o f t m a x } \big ( \widetilde { H } _ { i } ^ { k } L \big ) } \end{array}

Experiment

LLMs exhibit a myopic planning horizon, with precise final-answer planning emerging only near the end of reasoning, not at the start, especially for compositional tasks like Parity and Cycle.
Early hidden states may show coarse signals hinting at the answer gist, particularly in semantic tasks like CSQA, but these signals reflect vague perception rather than structured planning and yield lower accuracy than direct answering or full CoT.
LLMs show limited foresight over subsequent reasoning steps, with prediction accuracy declining sharply beyond the next two tokens, except in structurally modular tasks where patterns are discernible.
Global reasoning length is poorly predicted early in CoT; apparent correlations in some tasks stem from observable input heuristics rather than genuine planning.
Training an in-domain LLM yields shorter, more decisive CoT trajectories and competitive performance despite smaller scale, validating effective induction of stable reasoning paths.
Uncertainty in CoT can be better estimated by focusing on a few critical “pivot” tokens rather than averaging across the full trajectory, improving calibration significantly.
Early CoT signals can identify when CoT is unnecessary, enabling safe bypass for simpler tasks without harming overall accuracy, reducing computational load.

소스 PDF

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

체인-오브-스포크에서의 글로벌 플랜 부재: LLM의 잠재적 플래닝 호라이즌 탐구

Liyan Xu Mo Yu Fandong Meng Jie Zhou

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

체인-오브-스포크에서의 글로벌 플랜 부재: LLM의 잠재적 플래닝 호라이즌 탐구

Liyan Xu Mo Yu Fandong Meng Jie Zhou

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

체인-오브-스포크에서의 글로벌 플랜 부재: LLM의 잠재적 플래닝 호라이즌 탐구

Liyan Xu Mo Yu Fandong Meng Jie Zhou

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters