Command Palette
Search for a command to run...
ノイズの中での喪失:文脈的干渉要因により推論モデルが失敗する理由
ノイズの中での喪失:文脈的干渉要因により推論モデルが失敗する理由
Seongyun Lee Yongrae Jo Minju Seo Moontae Lee Minjoon Seo
Abstract
最近の推論モデルおよびエージェント型AIシステムの進展により、多様な外部情報への依存が増加している。しかし、この移行は inherently noisy(本質的にノイズが多い)な入力文脈をもたらす一方で、現行のクリーン化されたベンチマークはこの現実を捉えきれていない。本研究では、RAG(Retrieval-Augmented Generation)、推論、整合性、ツール利用といったタスクにおける11のデータセットを対象に、ランダムなドキュメント、関係のないチャット履歴、およびハードネガティブなノイズ(hard negative distractors)など、多様なノイズタイプに対してモデルのロバスト性を体系的に評価するための包括的ベンチマーク「NoisyBench」を提案する。評価の結果、最先端モデルにおいて、文脈的なノイズ要因が存在する状況下で性能が最大80%も急激に低下することが明らかになった。特に重要なのは、エージェント型ワークフローが、ノイズの多いツール出力に過度に信頼を寄せることでエラーを拡大する傾向があること、また、悪意のない状況でもノイズが新たな不整合(emergent misalignment)を引き起こす可能性があることである。提示された手法であるプロンプティング、コンテキスト設計、SFT(Supervised Fine-Tuning)、およびアウトカム報酬に基づく強化学習(outcome-reward only RL)は、ロバスト性を確保するのに不十分であることが判明した。これに対し、本研究で提案する「Rationale-Aware Reward(RARE)」は、ノイズの中から有用な情報を識別する行動をインセンティブ化することで、モデルの耐障害性を顕著に強化することを示した。さらに、テスト時計算量が増加するほど性能が低下する逆スケーリング現象(inverse scaling trend)を発見し、アテンション可視化を通じてモデルがノイズトークンに過剰に注目していることを実証した。これらの知見は、次世代の推論能力を備えたロバストなエージェントの設計に不可欠な洞察を提供する。
One-sentence Summary
Seongyun Lee, Yongrae Jo, Minju Seo, Moontae Lee, and Minjoon Seo from KAIST AI, LG AI Research, and University of Illinois Chicago introduce NoisyBench, a comprehensive benchmark evaluating reasoning models under realistic noise, revealing up to 80% performance drops in state-of-the-art models due to contextual distractors. They propose Rationale-Aware Reward (RARE), a novel training method that incentivizes models to identify helpful information within noise, significantly improving robustness by grounding reasoning in relevant content. Unlike outcome-based rewards, RARE enhances resilience by directly supervising the reasoning process, mitigating error amplification in agentic workflows and addressing emergent misalignment. Their findings highlight the critical need for noise-aware evaluation and training in real-world agentic AI systems.
Key Contributions
-
Current reasoning models and agentic AI systems perform well in clean, sanitized benchmarks but suffer catastrophic performance drops—up to 80%—when exposed to realistic noise such as random documents, irrelevant chat history, and hard negative distractors, revealing a critical gap between benchmark evaluations and real-world deployment.
-
The authors introduce NoisyBench, a comprehensive benchmark across 11 datasets spanning RAG, reasoning, alignment, and tool-use tasks, which systematically evaluates model robustness under diverse noise types and uncovers that agentic workflows often amplify errors by over-trusting noisy inputs, leading to emergent misalignment even without adversarial intent.
-
To improve resilience, they propose Rationale-Aware Reward (RARE), a training objective that incentivizes models to identify and prioritize helpful information within noisy contexts, significantly outperforming prompting, context engineering, SFT, and outcome-reward RL by increasing distractor filtering and improving final accuracy, while attention analysis reveals models disproportionately focus on distractor tokens in incorrect predictions.
Introduction
The authors address the growing reliance of agentic AI systems on external information, which introduces real-world noise such as irrelevant documents, faulty tool outputs, and misleading chat histories—challenges not captured by existing clean benchmarks. Prior work has focused on idealized, sanitized environments, leading to an overestimation of model robustness; even non-adversarial noise causes catastrophic performance drops of up to 80% in state-of-the-art models, with agentic workflows amplifying errors by encouraging over-trust in noisy inputs. The authors introduce NoisyBench, a comprehensive benchmark evaluating robustness across 11 datasets in RAG, reasoning, alignment, and tool-use under diverse noise types. They find that standard mitigation strategies—prompting, context engineering, supervised fine-tuning, and outcome-based reinforcement learning—fail to ensure resilience. In contrast, their proposed Rationale-Aware Reward (RARE) significantly improves performance by explicitly rewarding models for identifying and filtering helpful information within noisy contexts, leading to better reasoning grounding and higher accuracy. Analysis reveals that noise induces inverse scaling (worse performance with more computation), increased uncertainty, and attention misdirection toward distractors, highlighting fundamental vulnerabilities in current models.
Dataset
- The dataset NoisyBench comprises eleven diverse datasets across four task categories: Retrieval-Augmented Generation (RAG), reasoning, alignment, and tool usage.
- RAG datasets include SealQA, MultihopRAG, and Musique; reasoning datasets are BBEH-Mini, AIME25, and GPQA-Diamond; alignment datasets are Model-Written-Evaluations (Self-Awareness and Survival-Instinct) and BBQ; tool usage is represented by TauBench v1 (Retail and Airline).
- Each dataset is evaluated under four settings: clean (no distractors), with a random document, with a random chat history, and with a task-specific hard negative distractor.
- Random documents are sampled from RULER-HotPotQA, and random chat histories come from WildChat.
- Hard negative distractors are synthetically generated using an LLM (Gemini-2.5-Pro) by prompting with each question, ensuring they appear relevant but contain no useful or correct information.
- A two-stage filtering process removes 2.7% of samples: first, consistency with the original question is verified; second, any distractor implying or containing the gold answer is discarded.
- After filtering, 2,766 high-quality question-distractor pairs are constructed per setting, forming a balanced benchmark.
- The authors use this benchmark to evaluate model robustness, particularly in noisy retrieval and reasoning scenarios, and to train and fine-tune models using the NoisyInstruct dataset.
- NoisyInstruct is built from the NVIDIA Nemotron Nano 2 Post Training dataset and includes four data types: (A|Q), (A|Q,H), (A|Q,D), and (A|Q,D,H), with hints and distractors generated synthetically.
- Random documents for NoisyInstruct are drawn from Natural Questions, and random chat histories from the chat split of Nemotron Nano 2, ensuring no overlap with NoisyBench.
- Hints are generated similarly to hard negatives and filtered to avoid containing the gold answer, using Gemini-2.5-Pro as an LLM-as-a-judge.
- NoisyInstruct is available in four sizes: 4.5k, 45k, 450k, and 4.5m samples, enabling scalable training.
- The paper uses the NoisyInstruct dataset for supervised fine-tuning (SFT) and reinforcement learning (RL) with Group Reward Policy Optimization (GRPO), using gpt-oss-120b as a reward model.
- During training, models are exposed to all four data types, with mixture ratios tuned to balance robustness and performance.
- A cropping strategy is not explicitly applied; instead, the focus is on filtering and quality control to ensure distractors are non-informative and non-harmful.
- Metadata includes task category, distractor type, and similarity scores computed via sentence embeddings, normalized and binned for analysis.
- The authors emphasize ethical safeguards: harmful content is filtered using moderation APIs, and the benchmark is prohibited for training to prevent bias amplification.
- Qualitative examples in Appendix G illustrate failure cases, including hallucinations and biased responses, with warnings about potentially disturbing content.
Method
The authors leverage a modular framework designed to evaluate and enhance the robustness of language models in reasoning tasks, particularly under noisy or adversarial conditions. The overall architecture consists of distinct components that simulate real-world challenges, including information overload, misleading context, and the presence of hard negatives. The process begins with a clean benchmark setup, where the model is presented with a well-structured input containing a question, a gold answer, and a relevant document. This clean context is processed by an agent that performs alignment and reasoning to produce a correct output. As shown in the figure below, the agent successfully identifies the correct answer in the clean setting, demonstrating baseline performance.

The framework then extends to a noisy environment, where the model is exposed to perturbed inputs. In the NoisyBench setup, the agent receives a question embedded in a random document or a biased chat history, which introduces irrelevant or misleading information. The model's output is evaluated for alignment and reasoning accuracy. The figure below illustrates how the model fails to maintain correct alignment when faced with noise, producing outputs that are either incorrect or inconsistent with the original question. This highlights the fragility of current models when confronted with real-world distractions.

To further stress the model, the authors introduce a hard negative generation process. A prompt is provided to generate a long, distracting document that is content-aligned with the problem but does not alter the correct answer. The generated distractor must increase cognitive load and plausible confusion while remaining faithful to the problem's context. The output is strictly limited to the distractor text, without additional formatting or explanations. This step ensures that the generated content is challenging yet logically orthogonal to the correct answer.

The consistency of the generated hard negative is then evaluated using a separate prompt. An expert educational logic evaluator assesses whether the distractor belongs to the same category or logical type as the gold answer, maintains contextual validity, and does not alter or invalidate the original premise of the question. The evaluation is conducted step-by-step, with the final decision returned in a JSON format indicating whether the distractor is consistent. This ensures that the generated content is both plausible and logically sound, yet does not compromise the integrity of the original question.

Finally, the model's output is analyzed in a reinforcement learning (RL) setting, where it is tasked with solving a problem involving Maxwell's equations. The model generates a solution that includes a detailed analysis and conclusion, but the final answer is incorrect. This example demonstrates the model's ability to produce elaborate reasoning that appears plausible but ultimately fails to arrive at the correct conclusion, highlighting the need for more robust evaluation methods.

Experiment
- NoisyBench benchmark evaluates model robustness under four distractor types: No Distractor (ND), Random Documents (RD), Random Chat (RC), and Hard Negative (HN), revealing that real-world noise significantly degrades performance across RAG, reasoning, alignment, and tool usage tasks.
- All models show substantial performance drops under distractors, with declines ranging from 9% to 80.6%, and even top-performing models like Gemini-2.5-Pro suffer large drops (38.3% on average), indicating clean accuracy does not guarantee robustness.
- Random and hard negative distractors severely impair alignment, with Gemini-2.5-Pro dropping from 94.0% to 60.5% on BBQ, demonstrating that noise alone can induce emergent misalignment.
- Agentic workflows, which use tools and multi-step planning, perform worse under noise than base reasoning models due to overreliance on distractors, error propagation, and corrupted tool routing.
- Prompting, supervised fine-tuning (SFT), and context engineering (GEPA, DC, ACE) fail to improve robustness; SFT even causes catastrophic forgetting, while CE methods often discard useful information or inherit noise.
- Reinforcement learning with outcome-based rewards (OR) improves robustness over baselines, but Rationale-Aware Reward (RARE), which rewards correct source identification during reasoning, consistently outperforms OR across all settings and models.
- RARE reduces distracted reasoning chains and increases final accuracy by enabling fine-grained supervision over the reasoning process, proving that rewarding reasoning steps—not just outcomes—enhances robustness.
- Higher distractor similarity to the question increases reasoning token usage and reduces accuracy, while more distractors raise output entropy and lower confidence, indicating growing uncertainty.
- Attention analysis shows incorrect predictions are strongly associated with excessive attention to distractor tokens, confirming that models are misled by irrelevant but similar content.
- Mixed distractor types (e.g., RD+RC+HN) cause greater performance degradation than single types, even when total length is controlled, highlighting that distractor composition, not input length, drives performance loss.
- Larger models show improved robustness with size, but gains plateau beyond 8B parameters, indicating scaling alone is insufficient for noise resilience.
- Training on NoisyInstruct and using RARE transfers well to clean settings, with models achieving higher accuracy even without distractors, suggesting improved generalization and context filtering.
- Surprisingly, distractors improve jailbreak detection, increasing refusal rates on safety benchmarks without harming performance on harmless questions, suggesting noise may amplify attack signals.
The authors use a series of bar charts to compare the performance of base models and agentic workflows across four distractor settings: No Distractor (ND), Random Documents (RD), Random Chat (RC), and Hard Negative (HN). Results show that while agentic workflows improve performance in the clean ND setting, they consistently underperform the base models in all noisy conditions, indicating that agentic systems are more vulnerable to distractors due to overreliance on potentially misleading context.

The authors use a table to show the performance drop of three models under various distractor settings compared to the no-distractor baseline. Results show that all models experience significant accuracy declines when distractors are introduced, with the most severe drops occurring under hard negative and mixed distractor conditions. The DeepSeek-R1-Distill-Llama-8B model exhibits the largest degradation, particularly in the HN and RD+RC+HN settings, indicating high vulnerability to noisy inputs.

The authors use a series of bar charts to compare model performance under different distractor types and positioning, specifically examining the impact of placing the distractor before or after the question. Results show that performance varies significantly across models and distractor types, with some models like Qwen3-4B-Thinking-2507 performing better when the question appears first, while others like DeepSeek-R1-Distill-Llama-8B show little difference. The data indicates that distractor type and position influence model behavior, with hard negative distractors (HN) generally causing the most severe performance drops, and that placing the question after the distractor often leads to worse outcomes.

The authors use NoisyBench to evaluate model robustness under various distractor types, showing that all models experience significant performance drops when distractors are introduced, with declines ranging from 9% to 80%. Hard negative distractors cause the most severe degradation, and even random distractors substantially impair performance, indicating that robustness is not guaranteed by strong clean performance.

The authors use a series of bar charts to compare the performance of three models—Qwen3-4B-Thinking-2507, DeepSeek-R1-Distill-Llama-8B, and Qwen3-30B-A3B-Thinking-2507—under different context engineering methods (GEPA, DC, ACE) and distractor types (HN, RD, RC). Results show that context engineering methods provide limited improvements across all models and distractor settings, with performance generally remaining low or unchanged compared to the base model, indicating that these methods fail to enhance robustness in noisy environments.
