HyperAI

Math Word Problem Solving On Math

Metrics

Accuracy

Results

Performance results of various models on this benchmark

Model Name
Accuracy
Paper TitleRepository
Mixtral 8x7B (maj@4)28.4Mixtral of Experts
PaLM 2 (few-shot, k=4, CoT)34.3PaLM 2 Technical Report
Qwen2.5-Math-7B-Instruct(COT,Greedy)83.6Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement-
Qwen2.5-Math-7B-Instruct(TIR,Greedy)85.2Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement-
ToRA 70B (w/ code, SC, k=50)56.9ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving
Minerva 62B (maj1@k, k=64)43.4Solving Quantitative Reasoning Problems with Language Models
DAMOMath-7B64.5--
GPT-4-code model (CSV, w/ code, SC, k=16)84.3Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
Gemini 2.0 Flash Experimental89.7--
MMOS-DeepSeekMath-7B(0-shot)55.0An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning
CR (GPT-4-turbo model, w/ code)72.2Cumulative Reasoning with Large Language Models
MuggleMATH 7B25.8MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning
OpenChat-3.5 7B28.6OpenChat: Advancing Open-source Language Models with Mixed-Quality Data-
LLaMA 13B3.9LLaMA: Open and Efficient Foundation Language Models
PHP (GPT-4 model)53.9Progressive-Hint Prompting Improves Reasoning in Large Language Models
GPT-3-13B (few-shot)3.0Measuring Mathematical Problem Solving With the MATH Dataset
WizardMath-7B-V1.133.0WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)43.5Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
MuggleMATH-70B35.6MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning
DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)45.5DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
0 of 135 row(s) selected.