code-davinci-002 175B (LEVER, 8-shot) | 84.5 | 175 | LEVER: Learning to Verify Language-to-Code Generation with Execution | |
GPT-2-Medium 355M + question-solution classifier (BS=1) | 16.8 | 0.355 | Composing Ensembles of Pre-trained Models via Iterative Consensus | - |
OpenMath-CodeLlama-7B (w/ code) | 75.9 | 7 | OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | |
ChatGPT (Ask, Refine, Trust) | 82.6 | - | The ART of LLM Refinement: Ask, Refine, and Trust | - |
PaLM 540B (Self Consistency) | 74.4 | 540 | Large Language Models Can Self-Improve | - |
DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code) | 82.5 | 8 | DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving | |