Multi Task Language Understanding On Bbh Alg
评估指标
Average (%)
评测结果
各个模型在此基准测试上的表现结果
模型名称 | Average (%) | Paper Title | Repository |
---|---|---|---|
Flan-PaLM 540B (3-shot, fine-tuned, CoT) | 61.3 | Scaling Instruction-Finetuned Language Models | |
code-davinci-002 175B (CoT) | 73.9 | Evaluating Large Language Models Trained on Code | |
PaLM 540B (CoT) | 57.6 | Scaling Instruction-Finetuned Language Models | |
Flan-PaLM 540B (3-shot, fine-tuned, CoT + SC) | 66.5 | Scaling Instruction-Finetuned Language Models | |
PaLM 540B | 38.3 | Scaling Instruction-Finetuned Language Models | |
Flan-PaLM 540B (3-shot, fine-tuned) | 48.2 | Scaling Instruction-Finetuned Language Models | |
PaLM 540B (CoT + self-consistency) | 62.2 | Scaling Instruction-Finetuned Language Models |
0 of 7 row(s) selected.