Instruction Following On Ifeval

Inst-level loose-accuracy

Inst-level strict-accuracy

Prompt-level loose-accuracy

Prompt-level strict-accuracy

평가 결과

이 벤치마크에서 각 모델의 성능 결과

모델 이름	Inst-level loose-accuracy	Inst-level strict-accuracy	Prompt-level loose-accuracy	Prompt-level strict-accuracy	Paper Title
PaLM 2 S	59.11	55.76	46.95	43.07	Instruction-Following Evaluation for Large Language Models
AutoIF (Llama3 70B)	90.4	86.7	85.6	80.2	Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models
AutoIF (Qwen2 72B)	88	86.1	82.3	80.2	Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models
GPT-4	85.37	83.57	79.3	76.89	Instruction-Following Evaluation for Large Language Models

0 of 4 row(s) selected.