Math Word Problem Solving On Math

평가 지표

Accuracy

평가 결과

이 벤치마크에서 각 모델의 성능 결과

		Paper Title
Gemini 2.0 Flash Experimental	89.7	-
Qwen2.5-Math-72B-Instruct(TIR,Greedy)	88.1	Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
GPT-4 Turbo (MACM, w/code, voting)	87.920	MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical Problems
Qwen2.5-Math-72B-Instruct(COT,Greedy)	85.9	Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
Qwen2.5-Math-7B-Instruct(TIR,Greedy)	85.2	Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
GPT-4-code model (CSV, w/ code, SC, k=16)	84.3	Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
Qwen2-Math-72B-Instruct(greedy)	84.0	Qwen2 Technical Report
Qwen2.5-Math-7B-Instruct(COT,Greedy)	83.6	Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
Qwen2.5-Math-1.5B-Instruct(TIR,Greedy)	79.9	Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
OpenMath2-Llama3.1-70B (majority@256)	79.6	OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
OpenMath2-Llama3.1-8B (majority@256)	76.1	OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
Qwen2.5-Math-1.5B-Instruct(COT,Greedy)	75.8	Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
GPT-4-code model (CSV, w/ code)	73.5	Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
CR (GPT-4-turbo model, w/ code)	72.2	Cumulative Reasoning with Large Language Models
OpenMath2-Llama3.1-70B	71.9	OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
LogicNet (with code interpreter)	71.2	Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
Qwen2-72B-Instruct-Step-DPO (0-shot CoT, w/o code)	70.8	Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
GPT-4-code model (w/ code)	69.7	Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
OpenMath2-Llama3.1-8B	67.8	OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
AlphaMath-7B-SBS@3	66.3	AlphaMath Almost Zero: Process Supervision without Process

0 of 135 row(s) selected.

Command Palette

Math Word Problem Solving On Math

평가 지표

평가 결과