HyperAI超神经

Math Word Problem Solving On Svamp

评估指标

Execution Accuracy

评测结果

各个模型在此基准测试上的表现结果

模型名称
Execution Accuracy
Paper TitleRepository
Qwen2(CoT + Code Interpreter)92.3--
GTS with RoBERTa41.0Are NLP Models really able to Solve Simple Math Word Problems?
MMOS-CODE-7B(0-shot)76.4An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning
OpenMath-CodeLlama-70B (w/ code)87.8OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset
GPT-4 (Model Selection)93.7Automatic Model Selection with Large Language Models for Reasoning
ATHENA (roberta-large)54.8ATHENA: Mathematical Reasoning with Thought Expansion
ATHENA (roberta-base)45.6ATHENA: Mathematical Reasoning with Thought Expansion
DeBERTa63.5Math Word Problem Solving by Generating Linguistic Variants of Problem Statements
MsAT-DeductReasoner48.9Learning Multi-Step Reasoning by Solving Arithmetic Tasks
LSTM Seq2Seq with RoBERTa40.3Are NLP Models really able to Solve Simple Math Word Problems?
GPT-4 (Teaching-Inspired)93.9Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models-
PaLM (zero-shot)58.8Large Language Models are Zero-Shot Reasoners
PaLM (zero-shot, CoT)62.1Large Language Models are Zero-Shot Reasoners
GPT-4 DUP-Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems
GPT-4 (PHP)91.9Progressive-Hint Prompting Improves Reasoning in Large Language Models
Roberta-DeductReasoner47.3Learning to Reason Deductively: Math Word Problem Solving as Complex Relation Extraction
SYRELM (GPT-J)40.1Frugal LMs Trained to Invoke Symbolic Solvers Achieve Parameter-Efficient Arithmetic Reasoning
LLaMA 2-Chat69.2Llama 2: Open Foundation and Fine-Tuned Chat Models
Graph2Tree with RoBERTa43.8Are NLP Models really able to Solve Simple Math Word Problems?
MMOS-DeepSeekMath-7B(0-shot)79.3An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning
0 of 24 row(s) selected.