Long Context Understanding On Ada Leval Tsort
평가 지표
128k
16k
2k
32k
4k
64k
8k
평가 결과
이 벤치마크에서 각 모델의 성능 결과
모델 이름 | 128k | 16k | 2k | 32k | 4k | 64k | 8k | Paper Title | Repository |
---|---|---|---|---|---|---|---|---|---|
GPT-4-Turbo-0125 | 2.0 | 5.5 | 15.5 | 2.0 | 16.5 | 4.0 | 8.5 | GPT-4 Technical Report | |
GPT-3.5-Turbo-1106 | - | 5.5 | 4.0 | - | 4.5 | - | 4.5 | - | - |
ChatGLM2-6b-32k | - | 0.9 | 0.9 | - | 0.2 | - | 0.7 | GLM-130B: An Open Bilingual Pre-trained Model | |
ChatGLM3-6b-32k | - | 0.7 | 2.3 | - | 2.4 | - | 2.0 | GLM-130B: An Open Bilingual Pre-trained Model | |
LongChat-7b-v1.5-32k | - | 2.5 | 5.3 | - | 5.0 | - | 3.1 | Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena | |
Vicuna-7b-v1.5-16k | - | 1.7 | 5.3 | - | 2.2 | - | 2.3 | Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena | |
Claude-2 | - | 3.0 | 5.0 | 0.0 | 5.0 | 0.0 | 4.5 | - | - |
Vicuna-13b-v1.5-16k | - | 3.1 | 5.4 | - | 5.0 | - | 2.4 | Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena | |
GPT-4-Turbo-1106 | 6.0 | 3.5 | 18.5 | 6.0 | 15.5 | 6.0 | 7.5 | GPT-4 Technical Report | |
InternLM2-7b | - | 4.3 | 5.1 | - | 3.9 | - | 5.1 | InternLM2 Technical Report |
0 of 10 row(s) selected.