Mathematical Reasoning On Aime24
評価指標
Acc
評価結果
このベンチマークにおける各モデルのパフォーマンス結果
モデル名 | Acc | Paper Title | Repository |
---|---|---|---|
Qwen2.5-72B-Instruct | 23.3 | Qwen2.5 Technical Report | |
Search-o1 | 56.7 | Search-o1: Agentic Search-Enhanced Large Reasoning Models | |
Openai-o1 | 74.4 | - | - |
Openai-o1-preview | 44.6 | - | - |
Claude3.5-Sonnet | 16 | - | - |
Openai-o1-mini | 70.0 | - | - |
s1-32B | 56.7 | s1: Simple test-time scaling | |
DeepSeek-r1 | 79.8 | DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |
0 of 8 row(s) selected.