OpenAI’s o3-preview closes the gap with top open-source math models in AIMO2 benchmark, while NemoSkills and imagination-research lead the open-source charge in Olympiad-level reasoning
The performance gap between commercial and open-source large language models in solving Olympiad-level math problems is narrowing significantly, according to a new evaluation conducted by the Artificial Intelligence Mathematical Olympiad (AIMO) in collaboration with OpenAI. The study tested the unreleased o3-preview model—OpenAI’s generalist reasoning model—against top-performing open-source models from AIMO Progress Prize 2 (AIMO2), including NemoSkills and imagination-research, as well as the collective performance of all 2,000+ participating teams. The 50 problems used in the evaluation were newly created and entirely unseen by any of the models, making this one of the most robust contamination-free assessments of advanced mathematical reasoning to date. The o3-preview model, tested across low-, medium-, and high-compute configurations, achieved impressive results: 43/50 on low-compute, 46/50 on medium-compute, and 47/50 on high-compute (counting only top-ranked answers). The high-compute version even solved all 50 problems when including second-ranked answers, demonstrating strong potential. In comparison, the top two open-source models—NemoSkills and imagination-research—achieved 34/50 and 31/50 respectively on Kaggle’s public leaderboard. When evaluated on more powerful hardware (8x H100 GPUs) with no runtime restrictions, both models improved to 35/50, showing that performance can be enhanced with better compute and fewer constraints. The combined performance of the best submissions from all AIMO2 teams—referred to as AIMO2-combined—also reached 47/50, matching the high-compute o3-preview result. This suggests that the collective reasoning power of the open-source community, when aggregated, can nearly match the capabilities of a top-tier commercial model. Notably, o3-preview solved several problems—such as "TRIPAR" and "POLYDI"—that none of the top open-source models solved. Conversely, some problems like "EIGHTS" were solved only by o3-preview and lower-ranked Kaggle teams. The "RUNNER" problem was correctly answered only by NemoSkills and a subset of other teams, while o3-preview’s low- and medium-compute versions missed it entirely—though the high-compute version identified it as the second-best answer, suggesting a training or sampling limitation. The evaluation also revealed that while o3-preview is a generalist model, it performs at a level close to the best open-source models when compute is factored in. At roughly $1 per problem for low-compute runs, the cost is comparable to renting high-end hardware for a single open-source model. Adjusted for compute, the performance gap narrows considerably—open-source models now achieve results within five points of o3-preview on this benchmark. The AIMO team emphasizes that open-source models remain essential for scientific transparency and reproducibility. While commercial models still lead in raw performance, the rapid progress of open-source systems—especially through collaborative efforts and model ensembles—indicates that the divide is shrinking fast. AIMO Progress Prize 3, set to launch in Autumn 2025, will feature even more challenging problems at the International Mathematical Olympiad level. The competition will incorporate feedback from the community, including improvements to the format and evaluation process. A detailed technical report will follow, offering deeper analysis of model behavior and problem-solving patterns across the AIMO2 dataset.
