Apple's Study on AI Reasoning Sparks Debate: Models Limited by Token Constraints, Not Cognitive Failures
Apple’s machine learning (ML) research group sparked a heated debate with the release of "The Illusion of Thinking," a 53-page paper published earlier this month. The study argues that large reasoning models (LRMs) or reasoning large language models (LLMs) such as OpenAI’s “o” series and Google’s Gemini-2.5 Pro and Flash Thinking do not genuinely engage in independent reasoning or thinking based on generalized first principles learned from their training data. Instead, the researchers claim these models exhibit pattern matching, and their apparent reasoning abilities degrade significantly as tasks become more complex. The study employed four classic planning problems—Tower of Hanoi, Blocks World, River Crossing, and Checkers Jumping—to evaluate the models' reasoning capabilities. These puzzles were selected for their established roles in cognitive science and AI research, allowing the team to scale complexity and measure the models' performance in providing complete, step-by-step solutions. As the puzzles grew more intricate, the researchers observed a consistent decline in accuracy, with the most complex tasks resulting in zero performance. Additionally, the length of the models' internal reasoning traces, measured by token usage, decreased, leading the researchers to conclude that the models were abandoning problem-solving efforts when faced with high-complexity tasks. The timing of the paper's release, just prior to Apple's Worldwide Developers Conference (WWDC), amplified its impact. On X, many users interpreted the findings as a high-profile acknowledgment that current LLMs are merely advanced autocomplete engines, lacking true reasoning capabilities. This narrative spurred widespread discussion and debate. However, the study faced significant criticism from the ML community. ML researcher @scaling01 (Ruben Hassid) pointed out that Apple conflated token budget limitations with reasoning failures. He noted that models couldn't output enough steps for complex puzzles like Tower of Hanoi, where the number of required moves increases exponentially. Hassid demonstrated that models like Claude 3 Sonnet and DeepSeek-R1 generated algorithmically correct strategies, despite being marked as incorrect due to output constraints. Another critic emphasized that breaking down tasks into smaller, decomposed steps often degraded model performance, not due to a lack of understanding but because the models lacked memory of previous moves. This suggested that the real issue was the fixed context window of LLMs, rather than their reasoning abilities. Moreover, the absence of human performance benchmarks in the study was highlighted as a major flaw. Critics argued that comparing LRM performance against human benchmarks would provide a more grounded assessment of the models' capabilities, as humans also struggle with long, multistep puzzles without aids. The debate also revolved around the binary framing of the paper’s title, which drew a sharp line between "pattern matching" and "reasoning." Alexander Doria (Pierre-Carl Langlais) from Pleias argued that the models might be learning partial heuristics instead of purely matching patterns, adding nuance to the discussion. Ethan Mollick, an AI-focused professor at the Wharton School of Business, called the "wall-hitting" idea premature, citing past instances where similar claims about model collapse were later refuted. Adding fuel to the fire, a new paper titled "The Illusion of The Illusion of Thinking" was released on arXiv. Co-authored by independent researcher Alex Lawsen and Anthropic's Claude Opus 4, the rebuttal challenges Apple’s conclusions. Lawsen and Claude argue that the observed performance collapse was largely a result of the test setup, particularly the token limitations. For instance, the models had to output an impractically large number of steps for Tower of Hanoi with just 15 disks, leading to token overflow and incorrect evaluations. The rebuttal suggests that Apple’s evaluation script penalized models for hitting output ceilings, even when they followed correct solution strategies internally. The authors also pointed out methodological issues in the Apple benchmarks, such as mathematically unsolvable River Crossing puzzles. New experiments by Lawsen and Claude, allowing models to provide compressed, programmatic answers, showed that LRMs could succeed on far more complex problems when given a different output format. For example, producing a Lua function to generate the Tower of Hanoi solution eliminated the collapse entirely. The implications of this debate extend beyond academia, impacting enterprise decision-makers who build applications using reasoning LLMs. The controversy underscores the importance of designing evaluations that accurately reflect real-world use cases and avoid artificial constraints. Developers must consider practical limits, such as context windows, output budgets, and task formulation, to ensure reliable system performance. Hybrid solutions that externalize memory, chunk reasoning steps, or use compressed outputs can help address these limitations. For industries deploying tools like copilots, autonomous agents, or decision-support systems, understanding the constraints and potential of reasoning LLMs is crucial. The reliance on synthetic benchmarks should be tempered with caution, as they may not capture the nuanced capabilities of these models in practical scenarios. The primary takeaway for ML researchers is the need for rigorous and realistic testing environments before making definitive claims about AI milestones or limitations. The ongoing debate reflects the complexity and rapid evolution of the field, emphasizing that careful, transparent evaluation is essential for progress. Industry insiders and experts agree that the debate initiated by Apple’s paper, while contentious, has brought critical attention to the importance of evaluation design. Companies like Anthropic, known for their advanced LLMs, are contributing to a more nuanced understanding of AI capabilities. The controversy also highlights Apple’s lag in LLM development, as some critics suggested the company’s stance might be a strategic move to lower industry expectations. Despite these reservations, the paper and its responses have fostered a richer dialogue, pushing the boundaries of AI research and application.