Long Context Understanding On Ada Leval Tsort
评估指标
128k
16k
2k
32k
4k
64k
8k
评测结果
各个模型在此基准测试上的表现结果
模型名称 | 128k | 16k | 2k | 32k | 4k | 64k | 8k | Paper Title | Repository |
---|---|---|---|---|---|---|---|---|---|
GPT-4-Turbo-0125 | 2.0 | 5.5 | 15.5 | 2.0 | 16.5 | 4.0 | 8.5 | GPT-4 Technical Report | |
GPT-3.5-Turbo-1106 | - | 5.5 | 4.0 | - | 4.5 | - | 4.5 | - | - |
ChatGLM2-6b-32k | - | 0.9 | 0.9 | - | 0.2 | - | 0.7 | GLM-130B: An Open Bilingual Pre-trained Model | |
ChatGLM3-6b-32k | - | 0.7 | 2.3 | - | 2.4 | - | 2.0 | GLM-130B: An Open Bilingual Pre-trained Model | |
LongChat-7b-v1.5-32k | - | 2.5 | 5.3 | - | 5.0 | - | 3.1 | Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena | |
Vicuna-7b-v1.5-16k | - | 1.7 | 5.3 | - | 2.2 | - | 2.3 | Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena | |
Claude-2 | - | 3.0 | 5.0 | 0.0 | 5.0 | 0.0 | 4.5 | - | - |
Vicuna-13b-v1.5-16k | - | 3.1 | 5.4 | - | 5.0 | - | 2.4 | Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena | |
GPT-4-Turbo-1106 | 6.0 | 3.5 | 18.5 | 6.0 | 15.5 | 6.0 | 7.5 | GPT-4 Technical Report | |
InternLM2-7b | - | 4.3 | 5.1 | - | 3.9 | - | 5.1 | InternLM2 Technical Report |
0 of 10 row(s) selected.