Command Palette
Search for a command to run...
Le titre est vide. Veuillez fournir le titre à traduire.
Le titre est vide. Veuillez fournir le titre à traduire.
Modélisation du langage causal
Résumé
Please provide the title and abstract you would like me to translate into French.
One-sentence Summary
This critical review of large language model causal reasoning benchmarks reveals that many existing evaluations can be solved through domain knowledge retrieval rather than genuine inference, and it establishes a set of criteria prioritizing interventional and counterfactual tasks to guide a general assessment framework and the design of future benchmarks.
Key Contributions
- This review systematically analyzes large language model benchmarks for causal inference, demonstrating that existing evaluations frequently conflate correlation with causation and rely on pretraining data retrieval rather than algorithmic reasoning.
- The work establishes a four-criteria design framework requiring causal language, open-ended generation, multi-factor scalability, and non-retrievable fictional contexts to isolate interventional and counterfactual reasoning capabilities.
- The analysis contrasts limitations in existing single-step and multiple-choice tasks with scalable causal chain examples to provide a concrete methodology for constructing datasets that accurately measure causal understanding.
Introduction
As large language models grow more capable, accurately evaluating their causal reasoning skills has become critical for safe deployment in complex decision-making systems. The authors leverage established causal hierarchies to audit existing evaluation benchmarks and expose a fundamental flaw in prior work. Most current tasks only measure basic statistical associations and can be bypassed through simple pattern matching or retrieval of pretraining data instead of genuine reasoning. To resolve these issues, the authors propose four essential design criteria for future benchmarks. They argue that valid evaluations must use explicit causal language to test interventions or counterfactuals, require open-ended responses, scale across multiple interacting variables, and employ fictional contexts that eliminate memory retrieval. This structured approach aims to create a reliable framework for distinguishing true causal understanding from superficial knowledge recall in language models.
Dataset
-
Dataset Composition and Sources: The authors compiled a curated collection of 39 existing datasets and benchmarks sourced from peer-reviewed literature, academic repositories, and public GitHub archives. The collection focuses exclusively on tasks designed to evaluate causal reasoning capabilities in large language models.
-
Key Details for Each Subset: The benchmarks are organized into four functional categories. Causal relation identification tasks use human-annotated fact or fantasy contexts with multiple-choice formats. Commonsense knowledge tasks test real-world inference and domain-specific retrieval, often without providing in-context data. Story-based contextual reasoning tasks utilize long-form fictional narratives to prevent memorization and require multi-step synthesis. Graph-based and interventional tasks evaluate causal discovery, counterfactual reasoning, and directed acyclic graph reconstruction using conditional independence statements or structured prompts.
-
How the Paper Uses the Data: This collection is used strictly for benchmarking and critical analysis rather than model training or fine-tuning. The authors employ the datasets to compare LLM performance across different causal reasoning hierarchies, highlighting how existing benchmarks may inadvertently reward spurious language cues or rote knowledge retrieval over genuine abstraction and imagination.
-
Processing and Metadata Details: No custom cropping, metadata generation, or data mixture ratios are applied. The authors retain the original task structures and evaluation protocols from the source materials to ensure consistency with established research. They document specific limitations in the original processing pipelines, such as the entanglement of temporal and spatial ordering with causality, pair-wise edge evaluation in graph tasks, and the reliance on constrained multiple-choice formats that limit open-ended reasoning.