5時間前

Parshin Shojae Iman Mirzadeh Keivan Alizadeh Maxwell Horton Samy Bengio Mehrdad Farajtabar

概要

最先端の言語モデルの世代交代に伴い、回答を生成する前に詳細な思考過程を出力する「大規模推論モデル（LRM）」が登場している。これらのモデルは推論ベンチマークにおいて性能向上を示しているものの、その基本的な能力、スケーリング特性、および限界については依然として十分に理解されていない。現在の評価は主に確立された数学的およびコーディングのベンチマークに焦点を当て、最終的な解答の正確性を重視している。しかし、この評価パラダイムはデータ汚染（data contamination）の問題を抱えやすく、推論トレースの構造や質に関する洞察を提供するものではない。本研究では、論理構造を一貫させながら組合せ的な複雑さを精密に操作可能なパズル環境を用い、これらの課題を体系的に調査する。この手法により、最終的な解答だけでなく内部の推論トレースを分析可能となり、LRMがどのように「思考」するかについての洞察が得られる。多様なパズルを用いた広範な実験を通じて、最先端のLRMはある複雑さの閾値を超えると完全な精度の崩壊（accuracy collapse）を示すことを明らかにする。さらに、これらのモデルは直感に反するスケーリング限界を示す：問題の複雑さが増すにつれて推論に必要な努力は増加するが、一定の段階を過ぎると、十分なトークン予算が与えられているにもかかわらず、推論努力は減少する。

One-sentence Summary

This work systematically investigates Large Reasoning Models using controllable puzzle environments to analyze internal reasoning traces, revealing that unlike current evaluations emphasizing final answer accuracy, frontier LRMs face complete accuracy collapse beyond certain complexities and exhibit a counterintuitive scaling limit where reasoning effort declines despite an adequate token budget.

Key Contributions

The paper introduces controllable puzzle environments that allow precise manipulation of compositional complexity while maintaining consistent logical structures. This setup enables the analysis of internal reasoning traces alongside final answers to offer insights into how Large Reasoning Models think.
Through extensive experimentation across diverse puzzles, the work shows that frontier Large Reasoning Models face a complete accuracy collapse beyond certain complexities. This evidence addresses gaps in evaluations that primarily focus on final answer accuracy without considering reasoning trace quality.
The study identifies a counter-intuitive scaling limit where reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget. This observation clarifies the fundamental capabilities and scaling properties of Large Reasoning Models that remain insufficiently understood.

Introduction

Large Reasoning Models (LRMs) have emerged as powerful tools for complex problem solving by generating detailed thinking processes before answering, yet their fundamental capabilities and scaling properties remain insufficiently understood. Current evaluation paradigms rely on established math and coding benchmarks that often suffer from data contamination and fail to reveal the quality of internal reasoning traces. To address this, the authors leverage controllable puzzle environments that allow precise manipulation of problem complexity while maintaining consistent logical structures. This setup enables a systematic analysis of both final answers and reasoning traces, revealing that frontier LRMs experience complete accuracy collapse beyond specific complexity thresholds and exhibit a counter-intuitive scaling limit where reasoning effort decreases as problems become harder.

Dataset

The authors construct a procedural evaluation benchmark comprising four controllable puzzle environments to test reasoning capabilities. This dataset is not sourced from a static corpus but is generated dynamically based on specific complexity parameters.

Dataset Composition and Sources
- The benchmark includes Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World.
- Difficulty is controlled by adjusting parameters such as disk count, checker count, or block count.
Key Details for Each Subset
- Tower of Hanoi: Defined by $N$ disks. Constraints require moving only the top disk and never placing a larger disk on a smaller one.
- Checker Jumping: Defined by $2n$ checkers (red and blue). Valid moves include sliding into adjacent empty spaces or jumping over one opposite-colored checker without moving backward.
- River Crossing: Defined by $N$ actor-agent pairs. Boat capacity $k$ is set to 2 for $N \le 3$ pairs and 3 for larger sets. Safety rules prevent actors from being with foreign agents without their own agent.
- Blocks World: Defined by $N$ blocks. Initial states divide blocks alphabetically between two stacks. Goal states require interleaving blocks to force complete disassembly and reassembly.
Usage in the Model
- The data is used exclusively for evaluation to measure Pass@k performance.
- The authors compare thinking models against non-thinking counterparts across varying complexity levels.
- No training mixture ratios are applied as the focus is on inference-time reasoning.
Processing and Validation
- Custom simulators track state evolution and validate each move against puzzle constraints.
- Prompts include system instructions, rule definitions, and formatted examples for solution output.
- Validation processes check for peg boundaries, disk positions, move types, and final goal state achievement.

Method

The proposed framework evaluates reasoning capabilities by combining structured prompting with rigorous simulation-based verification. For logic puzzles such as the Tower of Hanoi, River Crossing, and Blocks World, the system employs specific system prompts that define the rules, initial states, and goal configurations. These prompts explicitly instruct the model to generate a sequence of moves, often requiring a structured format such as a list of tuples or specific tags. To ensure validity, custom simulators are integrated into the evaluation pipeline. For example, the River Crossing simulator enforces safety constraints regarding actors and agents, while the Blocks World simulator validates that only the topmost block is moved and checks stack boundaries.

The evaluation process involves parsing the model's generation to separate reasoning from the final result. Refer to the framework diagram for a visualization of this extraction pipeline.

The system distinguishes between the reasoning trace enclosed in <think> tags and the final answer in <answer> tags. Moves extracted from the thought process are used to reconstruct the state transitions, mapping the sequence from the initial state through middle states to the target state. This allows for a granular analysis of the reasoning path, distinct from the final accuracy measurement derived from the answer block. The associated plots demonstrate how reasoning effort and accuracy scale with problem complexity, indicating that deeper reasoning traces often correlate with correct solutions.

To further support complex reasoning, the method leverages algorithmic scratchpads. As shown in the figure below:

This approach allows the model to simulate the problem-solving process step-by-step within its generation, validating moves against specific rules before committing to a sequence. For instance, the framework encourages the use of recursive pseudocode with backtracking to explore potential solutions. This structured algorithmic guidance helps the model adhere to the puzzle constraints during the reasoning phase, ensuring that moves such as jumps or block transfers comply with the defined logic.

Experiment

The study evaluates frontier Large Reasoning Models against non-reasoning counterparts using controlled puzzle environments to systematically analyze performance across varying problem complexities. Results identify three distinct reasoning regimes where reasoning models excel at moderate complexity but ultimately collapse alongside standard models at high complexity, often reducing inference effort counterintuitively. Detailed trace analysis reveals inefficient overthinking on simple tasks and a fundamental inability to execute prescribed logical algorithms, indicating that current reasoning capabilities are limited by training data familiarity rather than pure computational complexity.

ソースPDF

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

5時間前

Parshin Shojae Iman Mirzadeh Keivan Alizadeh Maxwell Horton Samy Bengio Mehrdad Farajtabar

概要

One-sentence Summary

Key Contributions

The paper introduces controllable puzzle environments that allow precise manipulation of compositional complexity while maintaining consistent logical structures. This setup enables the analysis of internal reasoning traces alongside final answers to offer insights into how Large Reasoning Models think.
Through extensive experimentation across diverse puzzles, the work shows that frontier Large Reasoning Models face a complete accuracy collapse beyond certain complexities. This evidence addresses gaps in evaluations that primarily focus on final answer accuracy without considering reasoning trace quality.
The study identifies a counter-intuitive scaling limit where reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget. This observation clarifies the fundamental capabilities and scaling properties of Large Reasoning Models that remain insufficiently understood.

Introduction

Dataset

Dataset Composition and Sources
- The benchmark includes Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World.
- Difficulty is controlled by adjusting parameters such as disk count, checker count, or block count.
Key Details for Each Subset
- Tower of Hanoi: Defined by $N$ disks. Constraints require moving only the top disk and never placing a larger disk on a smaller one.
- Checker Jumping: Defined by $2n$ checkers (red and blue). Valid moves include sliding into adjacent empty spaces or jumping over one opposite-colored checker without moving backward.
- River Crossing: Defined by $N$ actor-agent pairs. Boat capacity $k$ is set to 2 for $N \le 3$ pairs and 3 for larger sets. Safety rules prevent actors from being with foreign agents without their own agent.
- Blocks World: Defined by $N$ blocks. Initial states divide blocks alphabetically between two stacks. Goal states require interleaving blocks to force complete disassembly and reassembly.
Usage in the Model
- The data is used exclusively for evaluation to measure Pass@k performance.
- The authors compare thinking models against non-thinking counterparts across varying complexity levels.
- No training mixture ratios are applied as the focus is on inference-time reasoning.
Processing and Validation
- Custom simulators track state evolution and validate each move against puzzle constraints.
- Prompts include system instructions, rule definitions, and formatted examples for solution output.
- Validation processes check for peg boundaries, disk positions, move types, and final goal state achievement.

Method

The evaluation process involves parsing the model's generation to separate reasoning from the final result. Refer to the framework diagram for a visualization of this extraction pipeline.

To further support complex reasoning, the method leverages algorithmic scratchpads. As shown in the figure below:

Experiment

ソースPDF

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

Command Palette

思考の幻想：問題複雑性の視点から推論モデルの強みと限界を理解する

Parshin Shojae Iman Mirzadeh Keivan Alizadeh Maxwell Horton Samy Bengio Mehrdad Farajtabar

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

思考の幻想：問題複雑性の視点から推論モデルの強みと限界を理解する

Parshin Shojae Iman Mirzadeh Keivan Alizadeh Maxwell Horton Samy Bengio Mehrdad Farajtabar

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

思考の幻想：問題複雑性の視点から推論モデルの強みと限界を理解する

Parshin Shojae Iman Mirzadeh Keivan Alizadeh Maxwell Horton Samy Bengio Mehrdad Farajtabar

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters