Multi Task Language Understanding On Bbh Nlp
평가 지표
Average (%)
평가 결과
이 벤치마크에서 각 모델의 성능 결과
모델 이름 | Average (%) | Paper Title | Repository |
---|---|---|---|
Qwen2.5-72B | 86.3 | - | - |
PaLM 540B (CoT) | 71.2 | Scaling Instruction-Finetuned Language Models | |
Orca 2-7B | 45.93 | Orca 2: Teaching Small Language Models How to Reason | - |
PaLM 540B | 62.7 | Scaling Instruction-Finetuned Language Models | |
Flan-PaLM 540B (5-shot, finetuned) | 70.0 | Scaling Instruction-Finetuned Language Models | |
Flan-PaLM 540B (3-shot, fine-tuned, CoT + SC) | 78.4 | Scaling Instruction-Finetuned Language Models | |
PaLM 540B (CoT + self-consistency) | 78.2 | Scaling Instruction-Finetuned Language Models | |
Orca 2-13B | 50.18 | Orca 2: Teaching Small Language Models How to Reason | - |
code-davinci-002 175B (CoT) | 73.5 | Evaluating Large Language Models Trained on Code | |
Qwen2-72B | 82.4 | - | - |
Jiutian-大模型 | 86.1 | - | - |
Flan-PaLM 540B (3-shot, fine-tuned, CoT) | 72.4 | Scaling Instruction-Finetuned Language Models | |
LLama-3-405B | 85.9 | - | - |
Jiutian-57B | 84.07 | - | - |
LLama-3-70B | 81.0 | - | - |
0 of 15 row(s) selected.