Arithmetic Reasoning On Gsm8K

평가 지표

Accuracy

Parameters (Billion)

평가 결과

이 벤치마크에서 각 모델의 성능 결과

모델 이름	Accuracy	Parameters (Billion)	Paper Title	Repository
code-davinci-002 175B (LEVER, 8-shot)	84.5	175	LEVER: Learning to Verify Language-to-Code Generation with Execution
GPT-2-Medium 355M + question-solution classifier (BS=1)	16.8	0.355	Composing Ensembles of Pre-trained Models via Iterative Consensus	-
ToRA-Code 7B	72.6	7	ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving
OpenMath-CodeLlama-7B (w/ code)	75.9	7	OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset
Gemini Pro (maj1@32)	86.5	-	Gemini: A Family of Highly Capable Multimodal Models
U-PaLM	58.5	540	Transcending Scaling Laws with 0.1% Extra Compute	-
MathCoder-CL-13B	74.1	7	MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning
OpenMath2-Llama3.1-70B	94.9	-	OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
ChatGPT (Ask, Refine, Trust)	82.6	-	The ART of LLM Refinement: Ask, Refine, and Trust	-
Camelidae-8×34B (5-shot)	78.3	-	Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks
Orca-Math 7B (fine-tuned)	86.8	7	Orca-Math: Unlocking the potential of SLMs in Grade School Math	-
PaLM 540B (Self Consistency)	74.4	540	Large Language Models Can Self-Improve	-
LLaMA 13B	17.8	13	LLaMA: Open and Efficient Foundation Language Models
MMOS-CODE-7B(0-shot)	73.9	7	An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning
OVM-Mistral-7B (verify100@1)	84.7	7	OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning
Minerva 62B (maj5@100)	89	62	Solving Quantitative Reasoning Problems with Language Models
Jiutian-大模型	95.2	75	-	-
DIVERSE 175B (8-shot)	83.2	175	Making Large Language Models Better Reasoners with Step-Aware Verifier	-
LLaMA 33B-maj1@k	53.1	33	LLaMA: Open and Efficient Foundation Language Models
DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)	82.5	8	DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

0 of 160 row(s) selected.