PaLM 2 (few-shot, k=4, CoT) | 34.3 | PaLM 2 Technical Report | |
Qwen2.5-Math-7B-Instruct(COT,Greedy) | 83.6 | Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement | - |
Qwen2.5-Math-7B-Instruct(TIR,Greedy) | 85.2 | Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement | - |
Minerva 62B (maj1@k, k=64) | 43.4 | Solving Quantitative Reasoning Problems with Language Models | |
GPT-4-code model (CSV, w/ code, SC, k=16) | 84.3 | Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification | |
Gemini 2.0 Flash Experimental | 89.7 | - | - |
CR (GPT-4-turbo model, w/ code) | 72.2 | Cumulative Reasoning with Large Language Models | |
Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256) | 43.5 | Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations | |
DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code) | 45.5 | DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving | |