Command Palette
Search for a command to run...
思考による想起:LLM における推論がパラメトリック知識を解き放つ仕組み
思考による想起:LLM における推論がパラメトリック知識を解き放つ仕組み
Zorik Gekhman Roee Aharoni Eran Ofek Mor Geva Roi Reichart Jonathan Herzig
概要
大規模言語モデル(LLM)における推論は、数学、コード生成、多段の事実問において自然的な役割を果たす一方で、単純な単段の事実問に対するその効果は依然として不明瞭である。これらの質問は段階的な論理的分解を必要としないため、推論の有効性は直観に反するように見える。しかし、我々の研究により、推論機能を有効化することが、モデルのパラメトリック知識の想起能力の境界を著しく拡張し、それ以前には事実上到達不可能であった正解を可能にすることが明らかになった。複雑な推論ステップが存在しない状況において、なぜ推論がパラメトリック知識の想起を支援するのか。この問いに答えるため、仮説に基づく一連の制御実験を設計し、二つの主要な駆動メカニズムを特定した。第一に「計算バッファ効果」であり、モデルは生成された推論トークンを、その意味内容とは独立した潜在計算の実行に利用する。第二に「事実的プライミング」であり、トピックに関連する事実を生成することが、正解の想起を促進する意味的架け橋として機能する。重要なのは、この後者の生成型自己検索メカニズムには本質的なリスクが伴う点である。我々は、推論過程における中間事実の幻覚(ハルシネーション)が、最終回答における幻覚の発生確率を増大させることを示した。最後に、これらの知見を活用し、幻覚を含まない事実記述を含む推論経路を優先させることで、モデルの精度を直接向上させることができることを実証した。
One-sentence Summary
Researchers from Google Research, Technion, and Tel Aviv University demonstrate that enabling reasoning in large language models expands parametric knowledge recall for simple questions through computational buffering and factual priming, while revealing that hallucinated intermediate facts significantly degrade final answer accuracy.
Key Contributions
- Enabling reasoning substantially expands the parametric knowledge recall boundary of large language models, unlocking correct answers for simple single-hop questions that are otherwise unreachable.
- Controlled experiments identify two driving mechanisms: a content-independent computational buffer effect and a content-dependent factual priming process where generating related facts acts as a semantic bridge for retrieval.
- A large-scale audit reveals that hallucinated intermediate facts increase the likelihood of final answer errors, while prioritizing hallucination-free reasoning trajectories at inference time significantly improves model accuracy.
Introduction
Reasoning in Large Language Models is well-established for complex tasks like math and coding, yet its value for simple, single-hop factual questions remains counterintuitive since these queries do not require logical decomposition. Prior research has largely focused on how reasoning aids multi-step problem solving or improves probability sharpening for already accessible answers, leaving a gap in understanding how it expands the model's fundamental parametric knowledge boundary. The authors demonstrate that enabling reasoning significantly unlocks correct answers that are otherwise unreachable by leveraging two distinct mechanisms: a content-independent computational buffer effect and a content-dependent factual priming process where the model generates related facts to bridge retrieval gaps. They further reveal that while this generative self-retrieval boosts accuracy, it introduces a risk where hallucinated intermediate facts increase the likelihood of final answer errors, a finding they use to propose inference strategies that prioritize hallucination-free reasoning trajectories.
Dataset
- Dataset Composition and Sources: The authors utilize a subset of the EntityQuestions dataset (Sciavolino et al., 2021), specifically focusing on 24 relations originally categorized by Gekhman et al. (2025).
- Subset Selection Criteria: From the original 24 relations, the team selected only 4 that meet two strict criteria: they must be "Hard to Guess" (where the answer space is large, such as person names) and "Well Defined" (where entity types and answer granularity are unambiguous).
- Data Structure and Processing: Each input sample consists of a question generated from a specific relation template paired with original facts provided as a summary.
- Model Usage: These curated relations serve as the foundation for evaluating the model's ability to handle complex, unambiguous entity queries rather than relying on common defaults.
Method
The proposed framework distinguishes between direct answer generation and reasoning-augmented generation. As illustrated in the first figure, the system operates in several modes: "OFF" (direct input to answer), "ON" (input to detailed thought process to answer), and variations involving "Dummy" thoughts which serve as placeholders or control conditions. In the "ON" mode, the model explicitly decomposes the query into steps such as identifying key entities, formulating search queries, and executing a search (simulated or actual) before stating the final answer.

The framework also incorporates scenarios where additional factual context is provided alongside the input question. The second figure demonstrates these variations, including "OFF Facts" where context is given but no thought process is generated, and "ON" where the model performs a detailed retrieval and counting process even when context is available. In the "ON" mode with facts, the reasoning trace includes steps for keyword optimization, information retrieval, and specific counting or identifying of entities (e.g., listing the 1st through 10th King of Nepal) to derive the answer.

Following the generation of these reasoning traces, a specific data processing module is employed to refine the extracted facts. Since reasoning traces often restate information already present in the question, the authors implement an LLM-based filtering step to remove such redundancies. This process utilizes a model (e.g., Gemini-2.5-Flash) to analyze the "Original Facts" against the input question. The filtering logic dictates that a fact is removed only if all the information it contains is explicitly stated in the question. Conversely, a fact is retained if it provides any new information, details, or context not found in the question, even if it partially repeats the question content.
Furthermore, specific rules are applied to prevent the model from simply memorizing the answer. A fact is removed if it states or implies that the target answer is the solution to the specific question. However, facts that mention the answer in an unrelated context or do not mention the answer at all are preserved. This ensures that the training data captures the reasoning path and external knowledge retrieval rather than just the final answer mapping.
Experiment
- Experiments using hybrid models with reasoning toggled ON or OFF on closed-book QA benchmarks demonstrate that reasoning consistently expands the model's parametric knowledge boundary, unlocking correct answers that remain unreachable without it, particularly at higher sampling depths.
- Analysis reveals that these gains are not primarily driven by decomposing complex multi-hop questions, as reasoning effectiveness remains similar for simple and complex question types, indicating the mechanism facilitates direct factual recall rather than task decomposition.
- Controlled tests validate two complementary mechanisms: a computational buffer effect where generating extra tokens enables latent computation independent of semantic content, and factual priming where recalling related facts creates a semantic bridge to the correct answer.
- Investigations into reasoning traces show that hallucinated intermediate facts systematically reduce the likelihood of a correct final answer, whereas traces containing verified factual statements significantly improve accuracy.
- Practical application of these findings through test-time selection strategies that prioritize traces with factual content and avoid hallucinations yields measurable improvements in model accuracy.