Multi Task Language Understanding On Mmlu
评估指标
Average (%)
评测结果
各个模型在此基准测试上的表现结果
比较表格
模型名称 | Average (%) |
---|---|
albert-a-lite-bert-for-self-supervised | 27.1 |
roberta-a-robustly-optimized-bert-pretraining | 27.9 |
measuring-massive-multitask-language | 43.9 |
mixtral-of-experts | 70.6 |
unifiedqa-crossing-format-boundaries-with-a | 48.9 |
scaling-language-models-methods-analysis-1 | 29.5 |
few-shot-learning-with-retrieval-augmented | 47.9 |
language-models-are-few-shot-learners | 43.9 |
glm-130b-an-open-bilingual-pre-trained-model | 44.8 |
llama-2-open-foundation-and-fine-tuned-chat | 54.8 |
gpt-neox-20b-an-open-source-autoregressive-1 | 33.6 |
deepseek-r1-incentivizing-reasoning | 87.5 |
scaling-instruction-finetuned-language-models | 33.7 |
model-card-and-evaluations-for-claude-models | 73.4 |
scaling-instruction-finetuned-language-models | 73.5 |
the-falcon-series-of-open-language-models | 57.0 |
llama-open-and-efficient-foundation-language-1 | 68.9 |
mistral-7b | 60.1 |
模型 19 | 77.5 |
模型 20 | 56.7 |
textbooks-are-all-you-need-ii-phi-1-5 | 37.9 |
scaling-instruction-finetuned-language-models | 28.7 |
llama-2-open-foundation-and-fine-tuned-chat | 62.6 |
scaling-instruction-finetuned-language-models | 59.5 |
scaling-instruction-finetuned-language-models | 45.1 |
branch-train-mix-mixing-expert-llms-into-a | 53.2 |
模型 27 | 31 |
llama-2-open-foundation-and-fine-tuned-chat | 45.3 |
bloomberggpt-a-large-language-model-for | 39.2 |
training-compute-optimal-large-language | 67.5 |
the-claude-3-model-family-opus-sonnet-haiku | 75.2 |
mixtral-of-experts | 62.5 |
bloomberggpt-a-large-language-model-for | 39.1 |
parameter-efficient-sparsity-crafting-from | 75.6 |
infoentropy-loss-to-mitigate-bias-of-learning | 29.68 |
bloomberggpt-a-large-language-model-for | 36 |
llama-3-meets-moe-efficient-upcycling | 86.6 |
llama-3-meets-moe-efficient-upcycling | 86.0 |
the-falcon-series-of-open-language-models | 28.0 |
breaking-the-ceiling-of-the-llm-community-by | 83.54 |
the-llama-3-herd-of-models | 73.0 |
llama-open-and-efficient-foundation-language-1 | 63.4 |
scaling-instruction-finetuned-language-models | 45.5 |
measuring-massive-multitask-language | 32.4 |
模型 45 | 83.7 |
the-falcon-series-of-open-language-models | 70.6 |
scaling-instruction-finetuned-language-models | 35.9 |
galactica-a-large-language-model-for-science-1 | 52.6 |
scaling-instruction-finetuned-language-models | 72.2 |
claude-3-5-sonnet-model-card-addendum | 88.7 |
scaling-instruction-finetuned-language-models | 40.5 |
unifying-language-learning-paradigms | 39.2 |
measuring-massive-multitask-language | 43.9 |
scaling-instruction-finetuned-language-models | 39.7 |
gpt-4-technical-report-1 | 70.0 |
llama-open-and-efficient-foundation-language-1 | 57.8 |
the-llama-3-herd-of-models | 73.7 |
the-claude-3-model-family-opus-sonnet-haiku | 79 |
leeroo-orchestrator-elevating-llms | 75.9 |
模型 60 | 71.8 |
sieve-general-purpose-data-filtering-system | 87 |