OpenAI's o3 Model Underperforms and Hallucinates More
In April 2025, OpenAI unveiled its latest inference models, o3 and o4-mini, marking a significant advancement in the field of artificial intelligence. However, third-party benchmark tests cast a shadow over the company's claims about the models' performance, raising questions about transparency and testing methodologies. Benchmark Discrepancies and Explanations Initial Claims and Live Demonstrations (2022): In December 2022, OpenAI's Chief Research Officer, Mark Chen, excited the tech community during a live presentation by asserting that o3 could solve over 25% of the challenging problems in FrontierMath, a significant improvement over other models that typically achieved only up to 2% accuracy. This claim was backed by internal tests with high computational settings, highlighting o3's potential to revolutionize AI capabilities. Third-Party Tests Reveal Lower Performance (2025): On April 18, 2025, research firm Epoch AI conducted an independent test of o3 and reported that the model's performance on FrontierMath was only around 10%, a stark contrast to OpenAI's 25% claim. Epoch AI acknowledged that their testing environment might differ from OpenAI's internal setup, but the performance gap was nonetheless substantial. This discrepancy fueled skepticism and criticism regarding OpenAI's transparency and the robustness of its testing methods. OpenAI's Response: Wenda Zhou, an OpenAI member, addressed the issue during a live Q&A session. He explained that the production version of o3 was optimized for real-world applications and speed, resulting in a more efficient use of computational resources. While this optimization improved practical usability and cost-effectiveness, it also led to a performance drop in specific benchmark tests. OpenAI has acknowledged the differences and plans to release an enhanced version, o3-pro, in the coming weeks, which is expected to deliver better benchmark results. The "Hallucination" Problem in New Models Concurrently, OpenAI's new models, o3 and o4-mini, face a critical issue known as "hallucination." This phenomenon occurs when AI models generate false or distorted information, a persistent challenge in the AI field. Internal and Third-Party Testing: OpenAI's internal tests revealed that o3 and o4-mini have higher hallucination rates compared to previous inference models. Specifically, o3 has a 33% hallucination probability on the PersonQA benchmark, which assesses the accuracy of information about individuals. This rate is nearly double that of earlier models like o1 and o3-mini. The o4-mini model fared even worse, with a 48% hallucination rate. Third-party testing by Transluce further confirmed these findings, with instances of o3 fabricating actions it never performed, such as running code on a 2021 MacBook Pro. Expert Insights and Recommendations: Neil Chowdhury, a researcher at Transluce, speculates that the enhanced reinforcement learning techniques used in o3 and o4-mini might exacerbate issues usually mitigated in standard post-training pipelines. Sarah Schwettmann, Transluce's co-founder, believes that the high hallucination rate will significantly diminish o3's practical value. Kian Katanforoosh, a lecturer at Stanford University, while praising o3's superior performance in coding and math tasks, noted the model's tendency to generate invalid web links, which could hinder its real-world applications. Potential Solutions: One proposed solution to reduce the hallucination rate is to equip AI models with web search capabilities. OpenAI's experiments show that GPT-4o, when augmented with web search, achieved 90% accuracy on the SimpleQA benchmark. However, this approach raises concerns about user privacy and data security, as it would require users to share their search queries with third-party engines. Shift in AI Development Over the past year, the AI industry has pivoted towards developing inference models. Traditional models have been hitting diminishing returns in terms of performance improvements, and inference models offer enhanced task handling with less computational and data overhead. However, this shift has also led to increased hallucination rates, making it a crucial challenge for researchers to address. Industry Reactions and Future Outlook Despite the performance discrepancies and hallucination issues, experts generally agree that OpenAI's models still exhibit significant potential and utility. The company's leading position in AI research remains unchallenged, but this incident underscores the need for transparency and rigorous third-party testing. Kian Katanforoosh emphasizes that while current reinforcement learning techniques are advancing inference capabilities, overcoming hallucination is essential for real-world adoption. Company Profiles and Background OpenAI: OpenAI is a global leader in AI research, dedicated to creating advanced and reliable AI systems. The organization has gained recognition for its groundbreaking models like GPT-3 and GPT-4, which have demonstrated capabilities across various applications, from natural language processing to coding and mathematics. Industry Insight: The AI landscape is increasingly competitive, and as companies like OpenAI push the boundaries of model performance, the importance of transparency and trust grows. Industry insiders agree that while o3 and o4-mini have notable strengths, addressing the hallucination issue is crucial for maintaining the model's credibility and broad appeal. OpenAI has committed to continuous improvements in model accuracy, stability, and trust, recognizing the necessity of balancing performance with reliability.
