HyperAIHyperAI

Command Palette

Search for a command to run...

ضائع في الضجيج: كيف تفشل نماذج الاستدلال مع العوامل المشتتة السياقية

Seongyun Lee Yongrae Jo Minju Seo Moontae Lee Minjoon Seo

Abstract

أدت التطورات الحديثة في نماذج الاستدلال وأنظمة الذكاء الاصطناعي الوظيفي إلى زيادة الاعتماد على مصادر معلومات خارجية متنوعة. ومع ذلك، فإن هذا التحول يُدخل سياقات إدخال متأثرة بالضوضاء بشكل جوهري، وهو ما لا تُجسّدّه المعايير المُنظفة الحالية. نقدّم "نويس بENCH" (NoisyBench)، وهي معيار شامل يُقيّم بشكل منهجي مرونة النماذج عبر 11 مجموعة بيانات في مهام توليد المحتوى القائم على المعرفة (RAG)، والاستدلال، والمواءمة، واستخدام الأدوات، ضد أنواع متنوعة من الضوضاء، بما في ذلك المستندات العشوائية، وسجلات المحادثات غير ذات الصلة، ومضلّلات سلبية صعبة. تُظهر تقييماتنا انخفاضًا كارثيًا في الأداء يصل إلى 80% في النماذج الرائدة عند مواجهة مُضَلّلات سياقية. وبشكل حاسم، نجد أن العمليات الوظيفية غالبًا ما تُضاعف هذه الأخطاء من خلال الثقة المفرطة في مخرجات الأدوات المُلوثة بالضوضاء، كما أن المُضَلّلات قادرة على توليد انحرافات مُبهمة حتى دون نية عدائية. ونكتشف أن تقنيات التحفيز (prompting)، وهندسة السياق، والتدريب المُخصص (SFT)، وتعلم التحسين القائم على المكافآت (outcome-reward only RL) لا تكفي لضمان المرونة؛ في المقابل، تُعزز المكافأة المُستندة إلى التبرير (Rationale-Aware Reward - RARE) التي نقترحها بشكل كبير قدرة النموذج على التحمل من خلال تحفيزه على تحديد المعلومات المفيدة ضمن الضوضاء. وأخيرًا، نكشف عن اتجاه عكسي في التوسع (inverse scaling)، حيث يؤدي الزيادة في الحساب أثناء الاختبار إلى أداء أسوأ في البيئات المُلوثة، ونُظهر من خلال تحليل التموضع الانتباه (attention visualization) أن النماذج تُركّز بشكل غير متناسب على الرموز المُضللة، مما يوفر رؤى حيوية لبناء الجيل التالي من الوكلاء القادرين على الاستدلال والمُتميّزين بالمرونة.

One-sentence Summary

Seongyun Lee, Yongrae Jo, Minju Seo, Moontae Lee, and Minjoon Seo from KAIST AI, LG AI Research, and University of Illinois Chicago introduce NoisyBench, a comprehensive benchmark evaluating reasoning models under realistic noise, revealing up to 80% performance drops in state-of-the-art models due to contextual distractors. They propose Rationale-Aware Reward (RARE), a novel training method that incentivizes models to identify helpful information within noise, significantly improving robustness by grounding reasoning in relevant content. Unlike outcome-based rewards, RARE enhances resilience by directly supervising the reasoning process, mitigating error amplification in agentic workflows and addressing emergent misalignment. Their findings highlight the critical need for noise-aware evaluation and training in real-world agentic AI systems.

Key Contributions

  • Current reasoning models and agentic AI systems perform well in clean, sanitized benchmarks but suffer catastrophic performance drops—up to 80%—when exposed to realistic noise such as random documents, irrelevant chat history, and hard negative distractors, revealing a critical gap between benchmark evaluations and real-world deployment.

  • The authors introduce NoisyBench, a comprehensive benchmark across 11 datasets spanning RAG, reasoning, alignment, and tool-use tasks, which systematically evaluates model robustness under diverse noise types and uncovers that agentic workflows often amplify errors by over-trusting noisy inputs, leading to emergent misalignment even without adversarial intent.

  • To improve resilience, they propose Rationale-Aware Reward (RARE), a training objective that incentivizes models to identify and prioritize helpful information within noisy contexts, significantly outperforming prompting, context engineering, SFT, and outcome-reward RL by increasing distractor filtering and improving final accuracy, while attention analysis reveals models disproportionately focus on distractor tokens in incorrect predictions.

Introduction

The authors address the growing reliance of agentic AI systems on external information, which introduces real-world noise such as irrelevant documents, faulty tool outputs, and misleading chat histories—challenges not captured by existing clean benchmarks. Prior work has focused on idealized, sanitized environments, leading to an overestimation of model robustness; even non-adversarial noise causes catastrophic performance drops of up to 80% in state-of-the-art models, with agentic workflows amplifying errors by encouraging over-trust in noisy inputs. The authors introduce NoisyBench, a comprehensive benchmark evaluating robustness across 11 datasets in RAG, reasoning, alignment, and tool-use under diverse noise types. They find that standard mitigation strategies—prompting, context engineering, supervised fine-tuning, and outcome-based reinforcement learning—fail to ensure resilience. In contrast, their proposed Rationale-Aware Reward (RARE) significantly improves performance by explicitly rewarding models for identifying and filtering helpful information within noisy contexts, leading to better reasoning grounding and higher accuracy. Analysis reveals that noise induces inverse scaling (worse performance with more computation), increased uncertainty, and attention misdirection toward distractors, highlighting fundamental vulnerabilities in current models.

Dataset

  • The dataset NoisyBench comprises eleven diverse datasets across four task categories: Retrieval-Augmented Generation (RAG), reasoning, alignment, and tool usage.
  • RAG datasets include SealQA, MultihopRAG, and Musique; reasoning datasets are BBEH-Mini, AIME25, and GPQA-Diamond; alignment datasets are Model-Written-Evaluations (Self-Awareness and Survival-Instinct) and BBQ; tool usage is represented by TauBench v1 (Retail and Airline).
  • Each dataset is evaluated under four settings: clean (no distractors), with a random document, with a random chat history, and with a task-specific hard negative distractor.
  • Random documents are sampled from RULER-HotPotQA, and random chat histories come from WildChat.
  • Hard negative distractors are synthetically generated using an LLM (Gemini-2.5-Pro) by prompting with each question, ensuring they appear relevant but contain no useful or correct information.
  • A two-stage filtering process removes 2.7% of samples: first, consistency with the original question is verified; second, any distractor implying or containing the gold answer is discarded.
  • After filtering, 2,766 high-quality question-distractor pairs are constructed per setting, forming a balanced benchmark.
  • The authors use this benchmark to evaluate model robustness, particularly in noisy retrieval and reasoning scenarios, and to train and fine-tune models using the NoisyInstruct dataset.
  • NoisyInstruct is built from the NVIDIA Nemotron Nano 2 Post Training dataset and includes four data types: (A|Q), (A|Q,H), (A|Q,D), and (A|Q,D,H), with hints and distractors generated synthetically.
  • Random documents for NoisyInstruct are drawn from Natural Questions, and random chat histories from the chat split of Nemotron Nano 2, ensuring no overlap with NoisyBench.
  • Hints are generated similarly to hard negatives and filtered to avoid containing the gold answer, using Gemini-2.5-Pro as an LLM-as-a-judge.
  • NoisyInstruct is available in four sizes: 4.5k, 45k, 450k, and 4.5m samples, enabling scalable training.
  • The paper uses the NoisyInstruct dataset for supervised fine-tuning (SFT) and reinforcement learning (RL) with Group Reward Policy Optimization (GRPO), using gpt-oss-120b as a reward model.
  • During training, models are exposed to all four data types, with mixture ratios tuned to balance robustness and performance.
  • A cropping strategy is not explicitly applied; instead, the focus is on filtering and quality control to ensure distractors are non-informative and non-harmful.
  • Metadata includes task category, distractor type, and similarity scores computed via sentence embeddings, normalized and binned for analysis.
  • The authors emphasize ethical safeguards: harmful content is filtered using moderation APIs, and the benchmark is prohibited for training to prevent bias amplification.
  • Qualitative examples in Appendix G illustrate failure cases, including hallucinations and biased responses, with warnings about potentially disturbing content.

Method

The authors leverage a modular framework designed to evaluate and enhance the robustness of language models in reasoning tasks, particularly under noisy or adversarial conditions. The overall architecture consists of distinct components that simulate real-world challenges, including information overload, misleading context, and the presence of hard negatives. The process begins with a clean benchmark setup, where the model is presented with a well-structured input containing a question, a gold answer, and a relevant document. This clean context is processed by an agent that performs alignment and reasoning to produce a correct output. As shown in the figure below, the agent successfully identifies the correct answer in the clean setting, demonstrating baseline performance.

The framework then extends to a noisy environment, where the model is exposed to perturbed inputs. In the NoisyBench setup, the agent receives a question embedded in a random document or a biased chat history, which introduces irrelevant or misleading information. The model's output is evaluated for alignment and reasoning accuracy. The figure below illustrates how the model fails to maintain correct alignment when faced with noise, producing outputs that are either incorrect or inconsistent with the original question. This highlights the fragility of current models when confronted with real-world distractions.

To further stress the model, the authors introduce a hard negative generation process. A prompt is provided to generate a long, distracting document that is content-aligned with the problem but does not alter the correct answer. The generated distractor must increase cognitive load and plausible confusion while remaining faithful to the problem's context. The output is strictly limited to the distractor text, without additional formatting or explanations. This step ensures that the generated content is challenging yet logically orthogonal to the correct answer.

The consistency of the generated hard negative is then evaluated using a separate prompt. An expert educational logic evaluator assesses whether the distractor belongs to the same category or logical type as the gold answer, maintains contextual validity, and does not alter or invalidate the original premise of the question. The evaluation is conducted step-by-step, with the final decision returned in a JSON format indicating whether the distractor is consistent. This ensures that the generated content is both plausible and logically sound, yet does not compromise the integrity of the original question.

Finally, the model's output is analyzed in a reinforcement learning (RL) setting, where it is tasked with solving a problem involving Maxwell's equations. The model generates a solution that includes a detailed analysis and conclusion, but the final answer is incorrect. This example demonstrates the model's ability to produce elaborate reasoning that appears plausible but ultimately fails to arrive at the correct conclusion, highlighting the need for more robust evaluation methods.

Experiment

  • NoisyBench benchmark evaluates model robustness under four distractor types: No Distractor (ND), Random Documents (RD), Random Chat (RC), and Hard Negative (HN), revealing that real-world noise significantly degrades performance across RAG, reasoning, alignment, and tool usage tasks.
  • All models show substantial performance drops under distractors, with declines ranging from 9% to 80.6%, and even top-performing models like Gemini-2.5-Pro suffer large drops (38.3% on average), indicating clean accuracy does not guarantee robustness.
  • Random and hard negative distractors severely impair alignment, with Gemini-2.5-Pro dropping from 94.0% to 60.5% on BBQ, demonstrating that noise alone can induce emergent misalignment.
  • Agentic workflows, which use tools and multi-step planning, perform worse under noise than base reasoning models due to overreliance on distractors, error propagation, and corrupted tool routing.
  • Prompting, supervised fine-tuning (SFT), and context engineering (GEPA, DC, ACE) fail to improve robustness; SFT even causes catastrophic forgetting, while CE methods often discard useful information or inherit noise.
  • Reinforcement learning with outcome-based rewards (OR) improves robustness over baselines, but Rationale-Aware Reward (RARE), which rewards correct source identification during reasoning, consistently outperforms OR across all settings and models.
  • RARE reduces distracted reasoning chains and increases final accuracy by enabling fine-grained supervision over the reasoning process, proving that rewarding reasoning steps—not just outcomes—enhances robustness.
  • Higher distractor similarity to the question increases reasoning token usage and reduces accuracy, while more distractors raise output entropy and lower confidence, indicating growing uncertainty.
  • Attention analysis shows incorrect predictions are strongly associated with excessive attention to distractor tokens, confirming that models are misled by irrelevant but similar content.
  • Mixed distractor types (e.g., RD+RC+HN) cause greater performance degradation than single types, even when total length is controlled, highlighting that distractor composition, not input length, drives performance loss.
  • Larger models show improved robustness with size, but gains plateau beyond 8B parameters, indicating scaling alone is insufficient for noise resilience.
  • Training on NoisyInstruct and using RARE transfers well to clean settings, with models achieving higher accuracy even without distractors, suggesting improved generalization and context filtering.
  • Surprisingly, distractors improve jailbreak detection, increasing refusal rates on safety benchmarks without harming performance on harmless questions, suggesting noise may amplify attack signals.

The authors use a series of bar charts to compare the performance of base models and agentic workflows across four distractor settings: No Distractor (ND), Random Documents (RD), Random Chat (RC), and Hard Negative (HN). Results show that while agentic workflows improve performance in the clean ND setting, they consistently underperform the base models in all noisy conditions, indicating that agentic systems are more vulnerable to distractors due to overreliance on potentially misleading context.

The authors use a table to show the performance drop of three models under various distractor settings compared to the no-distractor baseline. Results show that all models experience significant accuracy declines when distractors are introduced, with the most severe drops occurring under hard negative and mixed distractor conditions. The DeepSeek-R1-Distill-Llama-8B model exhibits the largest degradation, particularly in the HN and RD+RC+HN settings, indicating high vulnerability to noisy inputs.

The authors use a series of bar charts to compare model performance under different distractor types and positioning, specifically examining the impact of placing the distractor before or after the question. Results show that performance varies significantly across models and distractor types, with some models like Qwen3-4B-Thinking-2507 performing better when the question appears first, while others like DeepSeek-R1-Distill-Llama-8B show little difference. The data indicates that distractor type and position influence model behavior, with hard negative distractors (HN) generally causing the most severe performance drops, and that placing the question after the distractor often leads to worse outcomes.

The authors use NoisyBench to evaluate model robustness under various distractor types, showing that all models experience significant performance drops when distractors are introduced, with declines ranging from 9% to 80%. Hard negative distractors cause the most severe degradation, and even random distractors substantially impair performance, indicating that robustness is not guaranteed by strong clean performance.

The authors use a series of bar charts to compare the performance of three models—Qwen3-4B-Thinking-2507, DeepSeek-R1-Distill-Llama-8B, and Qwen3-30B-A3B-Thinking-2507—under different context engineering methods (GEPA, DC, ACE) and distractor types (HN, RD, RC). Results show that context engineering methods provide limited improvements across all models and distractor settings, with performance generally remaining low or unchanged compared to the base model, indicating that these methods fail to enhance robustness in noisy environments.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp