Instruction Following On Ifeval
Métriques
Inst-level loose-accuracy
Inst-level strict-accuracy
Prompt-level loose-accuracy
Prompt-level strict-accuracy
Résultats
Résultats de performance de divers modèles sur ce benchmark
Nom du modèle | Inst-level loose-accuracy | Inst-level strict-accuracy | Prompt-level loose-accuracy | Prompt-level strict-accuracy | Paper Title | Repository |
---|---|---|---|---|---|---|
PaLM 2 S | 59.11 | 55.76 | 46.95 | 43.07 | Instruction-Following Evaluation for Large Language Models | |
AutoIF (Llama3 70B) | 90.4 | 86.7 | 85.6 | 80.2 | Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models | |
AutoIF (Qwen2 72B) | 88 | 86.1 | 82.3 | 80.2 | Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models | |
GPT-4 | 85.37 | 83.57 | 79.3 | 76.89 | Instruction-Following Evaluation for Large Language Models |
0 of 4 row(s) selected.