Command Palette
Search for a command to run...
ما مدى الاستدلال الذي تضيفه النماذج المدعومة بالاسترجاع مقابل النماذج الكبيرة للغة؟ إطار لاختبار الاستدلال متعدد الخطوات على المعرفة الهجينة
ما مدى الاستدلال الذي تضيفه النماذج المدعومة بالاسترجاع مقابل النماذج الكبيرة للغة؟ إطار لاختبار الاستدلال متعدد الخطوات على المعرفة الهجينة
Junhong Lin Bing Zhang Song Wang Ziyan Liu Dan Gutfreund Julian Shun Yada Zhu
الملخص
تستمر النماذج اللغوية الكبيرة (LLMs) في مواجهة صعوبات في الإجابة على الأسئلة التي تتطلب معرفة متقدمة وتحتاج إلى معلومات محدثة واستنتاجات متعددة الخطوات. يُعدّ تعزيز النماذج اللغوية الكبيرة بمعرفة خارجية هجينة، مثل النصوص غير المنظمة والرسوم البيانية المعرفية المنظمة، بديلاً واعداً بديلاً عن إعادة التدريب المستمر المكلف. وبما أن ذلك، أصبح تقييم موثوق لقدرات الاسترجاع والاستنتاج أمرًا بالغ الأهمية. ومع ذلك، فإن العديد من المعايير الحالية تشهد تداخلًا متزايدًا مع بيانات التدريب المسبق للنماذج اللغوية الكبيرة، ما يعني أن الإجابات أو المعرفة الداعمة قد تكون بالفعل مُشَكَّلة في معاملات النموذج، مما يجعل من الصعب التمييز بين الاسترجاع الحقيقي والاستنتاج الفعلي من مجرد استرجاع مبني على الذاكرة المُدمجة في المعاملات. نقدّم HybridRAG-Bench، وهي إطار عمل لبناء معايير لتقييم الاسترجاع المكثف والاستنتاج المتعدد الخطوات على المعرفة الهجينة. يُربط HybridRAG-Bench تلقائيًا بين تمثيلات النصوص غير المنظمة والرسوم البيانية المعرفية المنظمة المستمدة من الأدبيات العلمية الحديثة على arXiv، وينتج أزواجًا من الأسئلة والإجابات التي تعتمد على مسارات استنتاج صريحة ومعرفة متقدمة. ويتيح هذا الإطار اختيارًا مرنًا للمجالات والفترات الزمنية، ما يمكّن من تقييم يراعي خطر التلوث (contamination) ويسمح بتخصيص التقييم وفقًا لتطور النماذج والمعرفة. وقد أظهرت التجارب في ثلاث مجالات (الذكاء الاصطناعي، والحوكمة والسياسات، والبيولوجيا الحاسوبية) أن HybridRAG-Bench يكافئ الاسترجاع والاستنتاج الحقيقي، وليس مجرد الاسترجاع المبني على الذاكرة المدمجة، مما يوفر بيئة تجريبية منهجية لتقييم أنظمة الاستنتاج المدعومة بمعرفة هجينة. ونُطلق كودنا وبياناتنا على github.com/junhongmit/HybridRAG-Bench.
One-sentence Summary
Researchers from MIT, IBM, and UCF introduce HYBRIDRAG-BENCH, a contamination-aware framework that evaluates LLMs’ multi-hop reasoning over hybrid knowledge by coupling arXiv-derived text and knowledge graphs, enabling domain-specific, time-bound assessments that distinguish true retrieval from parametric recall.
Key Contributions
- HYBRIDRAG-BENCH introduces a contamination-aware benchmark framework that constructs multi-hop reasoning tasks from recent arXiv literature, ensuring questions rely on external retrieval rather than parametric recall by using time-framed, evolving scientific corpora.
- The framework automatically generates hybrid knowledge environments combining unstructured text and structured knowledge graphs, producing diverse question types grounded in explicit reasoning paths to evaluate genuine retrieval and reasoning across domains like AI, policy, and bioinformatics.
- Experiments show HYBRIDRAG-BENCH effectively discriminates between models that perform true retrieval-based reasoning and those relying on pretraining memorization, offering a scalable, customizable testbed for evaluating hybrid knowledge-augmented systems.
Introduction
The authors leverage retrieval-augmented generation (RAG) and structured knowledge graphs to tackle knowledge-intensive, multi-hop reasoning tasks where large language models (LLMs) often fail due to outdated knowledge or insufficient reasoning. Prior benchmarks suffer from pretraining contamination—where models answer correctly by memorization rather than retrieval—making it hard to assess whether systems truly reason or just recall. To address this, they introduce HYBRIDRAG-BENCH, a framework that automatically constructs contamination-aware benchmarks from recent arXiv papers, coupling unstructured text with structured knowledge graphs and generating questions tied to explicit reasoning paths. Their method enables fair, scalable evaluation across domains and timeframes, distinguishing genuine retrieval and multi-hop reasoning from parametric recall.
Dataset

The authors use HYBRIDRAG-BENCH, a dynamically constructed, domain-specific benchmark for evaluating retrieval-augmented and knowledge-grounded LLMs. Key details:
-
Dataset Composition and Sources
- Built from arXiv papers collected per domain using subject categories (e.g., cs.AI) and optional keywords.
- Each domain has its own evolving knowledge graph derived from documents within a user-specified time window, postdating LLM pretraining cutoffs to prevent parametric memorization.
- No shared entities or relations across domains.
-
Key Subset Details
- Question Types: Single-hop, single-hop with conditions, multi-hop, difficult multi-hop (via high-degree entities), counterfactual, and open-ended.
- Size: Varies by domain; distribution per type shown in Table 2.
- Filtering: Questions must be answerable solely from hybrid context (KG path + supporting text), pass LLM-as-a-judge faithfulness checks, and avoid document-local references or ambiguity.
- Metadata: Includes paper title, authors, categories, and timestamps; text is segmented by section (abstract, methods, etc.).
-
Data Usage in Model Training/Evaluation
- Questions are generated by conditioning an LLM on: (1) a sampled KG reasoning path, (2) associated textual evidence, and (3) in-context examples.
- Each QA pair is tied to a specific knowledge graph snapshot at time t_i, ensuring temporal isolation.
- Used exclusively for evaluation—not training—of LLMs to measure ability to integrate structured and unstructured evidence.
-
Processing and Cropping Strategy
- Reasoning paths are sampled from the domain’s knowledge graph; textual spans are retrieved for each entity/relation.
- Questions are synthesized to obscure intermediate nodes (in multi-hop cases) or include counterfactual perturbations.
- Final QA pairs undergo normalization (lowercasing, punctuation) and multi-stage filtering for clarity, faithfulness, and independence.
Method
The authors leverage a four-stage automated pipeline to construct HYBRIDRAG-BENCH, a benchmark designed to evaluate retrieval-augmented reasoning over hybrid knowledge sources. The framework begins with user-specified parameters—such as time frame, topic area, and question types—which guide the collection of a time-framed arXiv corpus. This corpus serves as the foundational data source for both unstructured text chunks and structured knowledge graphs.
Refer to the framework diagram for an overview of the pipeline’s architecture. The first stage, Time-Framed Corpus Collection, ingests documents from arXiv based on user constraints. These documents are then processed in parallel to generate two complementary knowledge representations: unstructured text chunks and a structured knowledge graph. The knowledge graph is constructed using EvoKG, a document-driven framework that extracts entities and relations via large language models. Entity extraction is followed by context-aware alignment, which resolves lexical variation and semantic ambiguity by matching new mentions against existing nodes using joint embeddings of type, name, and description. If no sufficiently similar node exists, a new entity is created; otherwise, the mention is merged, preserving provenance.

Relation normalization follows, where extracted relations are mapped to a domain-specific schema and linked to supporting textual evidence. The graph retains multiple candidate relations when supported by the corpus, annotated with confidence scores derived from frequency, recency, and textual support—thereby preserving the uncertainty and variation inherent in scientific literature.
In the third stage, Hybrid-Grounded QA Generation, the system synthesizes diverse question-answer pairs grounded in explicit reasoning paths that traverse both the knowledge graph and retrieved text chunks. These questions span single-hop, multi-hop, conditional, counterfactual, and open-ended reasoning types. The final stage, QA Pairs Quality Control, applies automated filters to ensure answerability, independence from document phrasing, and non-redundancy, yielding evaluation-ready QA pairs.
The resulting benchmark enables controlled, reproducible evaluation of RAG and KG-RAG systems by providing both structured and unstructured knowledge sources that reflect real-world scientific discourse. The model’s prediction is formally defined as a^=f(q,Gtq(m),Dtq(m)), where the model f reasons over the knowledge graph snapshot Gtq(m) and retrieved documents Dtq(m) available at query time tq.
Experiment
- HYBRIDRAG-Bench poses persistent challenges across LLM scales, confirming questions cannot be reliably answered by parametric knowledge alone and require external retrieval and reasoning.
- External retrieval is essential: text-based RAG significantly improves performance over LLM-only methods, while naive KG injection often degrades results due to noise.
- Structured knowledge adds complementary value: hybrid KG-RAG methods consistently outperform text-only RAG, especially on relational, multi-hop, and disambiguation tasks.
- The benchmark effectively discriminates between reasoning strategies: performance varies meaningfully by question type, with structured methods excelling on multi-hop and counterfactual queries, while text retrieval dominates open-ended questions.
- KG construction is effective and scalable: the pipeline recovers ~71% of verifiable facts and scales near-linearly in cost and latency, ensuring practical deployment without performance bottlenecks.
The authors evaluate their KG construction pipeline against prior methods and find that EvoKG captures significantly more verifiable facts from source documents, achieving a 71.36% recovery rate compared to 66.46% for KGen and lower rates for OpenIE and GraphRAG. This indicates that the knowledge graphs used in HybridRAG-Bench are robust and not a limiting factor in the benchmark’s difficulty. Results confirm that the challenge stems from retrieval and reasoning demands rather than incomplete or inaccurate knowledge extraction.

The authors use HybridRAG-Bench to evaluate how different retrieval and reasoning strategies perform across domain-specific tasks, finding that LLM-only approaches consistently underperform regardless of model scale. Results show that combining structured knowledge graphs with text-based retrieval yields the strongest performance, especially on multi-hop and counterfactual questions, indicating that effective reasoning requires more than just access to external information. The benchmark meaningfully distinguishes between methods by question type, revealing that hybrid approaches outperform both pure text retrieval and naive graph augmentation.

The authors use HybridRAG-Bench to evaluate how different retrieval and reasoning strategies perform across domain-specific tasks, finding that LLM-only approaches consistently underperform regardless of model scale. Results show that combining structured knowledge graphs with text-based retrieval yields the strongest performance, especially on multi-hop and counterfactual questions, indicating that effective reasoning requires more than just access to external information. The benchmark meaningfully distinguishes between methods by question type, revealing that hybrid approaches outperform both pure text retrieval and naive graph augmentation.

The authors use HybridRAG-Bench to evaluate how different LLMs and retrieval strategies perform on knowledge-intensive reasoning tasks across three domains. Results show that LLM-only approaches perform poorly regardless of model scale, while hybrid methods combining structured knowledge graphs with text retrieval consistently outperform text-only RAG, especially on multi-hop and counterfactual questions. The benchmark effectively discriminates between reasoning strategies, revealing that success depends more on how knowledge is integrated than on model size alone.

The authors use HybridRAG-Bench to evaluate how different retrieval and reasoning strategies perform across domain-specific tasks, finding that LLM-only approaches consistently underperform regardless of model scale. Results show that combining structured knowledge graphs with text-based retrieval yields the strongest performance, especially on multi-hop and counterfactual questions, indicating that effective reasoning requires more than just access to external information. The benchmark meaningfully distinguishes between methods by question type, revealing that hybrid approaches outperform text-only or naive graph augmentation strategies across all domains.
