Long Context Understanding On Ada Leval Tsort
評価指標
128k
16k
2k
32k
4k
64k
8k
評価結果
このベンチマークにおける各モデルのパフォーマンス結果
比較表
モデル名 | 128k | 16k | 2k | 32k | 4k | 64k | 8k |
---|---|---|---|---|---|---|---|
gpt-4-technical-report-1 | 2.0 | 5.5 | 15.5 | 2.0 | 16.5 | 4.0 | 8.5 |
モデル 2 | - | 5.5 | 4.0 | - | 4.5 | - | 4.5 |
glm-130b-an-open-bilingual-pre-trained-model | - | 0.9 | 0.9 | - | 0.2 | - | 0.7 |
glm-130b-an-open-bilingual-pre-trained-model | - | 0.7 | 2.3 | - | 2.4 | - | 2.0 |
judging-llm-as-a-judge-with-mt-bench-and-1 | - | 2.5 | 5.3 | - | 5.0 | - | 3.1 |
judging-llm-as-a-judge-with-mt-bench-and-1 | - | 1.7 | 5.3 | - | 2.2 | - | 2.3 |
モデル 7 | - | 3.0 | 5.0 | 0.0 | 5.0 | 0.0 | 4.5 |
judging-llm-as-a-judge-with-mt-bench-and-1 | - | 3.1 | 5.4 | - | 5.0 | - | 2.4 |
gpt-4-technical-report-1 | 6.0 | 3.5 | 18.5 | 6.0 | 15.5 | 6.0 | 7.5 |
internlm2-technical-report | - | 4.3 | 5.1 | - | 3.9 | - | 5.1 |