Large Reasoning Models Excel Until Complexity Overwhelms Them: New Dataset Reveals Sharp Performance Drop Beyond Training Regime
Large language models (LLMs) have made remarkable strides in reasoning tasks, demonstrating impressive capabilities in areas like mathematics, science, and logic. However, recent research reveals a critical limitation: transformers and LLMs often fail dramatically when reasoning problems exceed a certain level of complexity. This paper re-examines these findings through the lens of Large Reasoning Models (LRMs)—LLMs specifically fine-tuned with incentives for step-by-step reasoning and self-verification. On benchmarks such as NLGraph, LRMs have shown seemingly extraordinary performance, with some claims suggesting they can achieve generalized reasoning and even innovate in complex domains like mathematics, physics, medicine, and law. Yet, upon closer inspection, the complexity of existing benchmarks turns out to be surprisingly low. By systematically increasing the scale of reasoning challenges, the study introduces a new dataset called the Deep Reasoning Dataset (DeepRD), designed to generate an unlimited number of examples with controllable, scalable complexity. Using DeepRD, the researchers evaluate LRMs on two key tasks: graph connectivity and natural language proof planning. The results show a sharp and abrupt decline in model performance as problem complexity increases, indicating that current LRMs do not generalize beyond the complexity of their training data. Further analysis connects these findings to real-world data. The study examines the distribution of complexity in large-scale knowledge graphs, interaction networks, and formal proof repositories. It finds that while the majority of real-world examples fall within the range where LRMs perform well, the long tail of highly complex cases—though rare—represents a significant failure mode. This means that in high-stakes applications, where rare but complex problems arise, LRMs are likely to fail without warning. The results suggest that while LRMs are highly useful in practical, near-term settings where problems are within their success regime, they are not a path to robust, generalizable reasoning. The paper concludes that current approaches are fundamentally limited by their dependence on training data complexity and calls for new methods that can reason effectively beyond the boundaries of their training distribution. Without such advances, the promise of truly autonomous, innovative reasoning in AI remains out of reach.
