Long Context Understanding On Ada Leval
评估指标
12k
16k
1k
2k
4k
6k
8k
评测结果
各个模型在此基准测试上的表现结果
模型名称 | 12k | 16k | 1k | 2k | 4k | 6k | 8k | Paper Title | Repository |
---|---|---|---|---|---|---|---|---|---|
Vicuna-7b-v1.5-16k | 1.9 | 1.0 | 37.0 | 11.1 | 5.8 | 3.2 | 1.8 | Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena | |
Claude-2 | 12.0 | 11.0 | 65.0 | 43.5 | 23.5 | 15.0 | 17.0 | - | - |
LongChat-7b-v1.5-32k | 1.6 | 0.8 | 32.4 | 10.7 | 5.7 | 3.1 | 1.9 | Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena | |
Vicuna-13b-v1.5-16k | 1.4 | 0.9 | 53.4 | 29.2 | 13.1 | 4.3 | 2.2 | Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena | |
GPT-3.5-Turbo-1106 | 2.5 | 2.5 | 61.5 | 48.5 | 41.5 | 29.5 | 17.0 | - | - |
ChatGLM3-6b-32k | 0.9 | 0.5 | 39.8 | 18.8 | 9.0 | 5.0 | 3.4 | GLM-130B: An Open Bilingual Pre-trained Model | |
ChatGLM2-6b-32k | 0.0 | 0.3 | 31.2 | 10.9 | 4.5 | 1.6 | 1.6 | GLM-130B: An Open Bilingual Pre-trained Model | |
InternLM2-7b | 2.0 | 0.8 | 58.6 | 49.5 | 33.9 | 12.3 | 13.4 | InternLM2 Technical Report | |
GPT-4-Turbo-0125 | 52.0 | 44.5 | 73.5 | 73.5 | 65.5 | 63.0 | 56.5 | GPT-4 Technical Report | |
GPT-4-Turbo-1106 | 49.5 | 44.0 | 74.0 | 73.5 | 67.5 | 59.5 | 53.5 | GPT-4 Technical Report |
0 of 10 row(s) selected.