Long Context Understanding On Ada Leval

Metrics

12k

16k

1k

2k

4k

6k

8k

Results

Performance results of various models on this benchmark

								Paper Title
GPT-4-Turbo-0125	52.0	44.5	73.5	73.5	65.5	63.0	56.5	GPT-4 Technical Report
GPT-4-Turbo-1106	49.5	44.0	74.0	73.5	67.5	59.5	53.5	GPT-4 Technical Report
Claude-2	12.0	11.0	65.0	43.5	23.5	15.0	17.0	-
GPT-3.5-Turbo-1106	2.5	2.5	61.5	48.5	41.5	29.5	17.0	-
InternLM2-7b	2.0	0.8	58.6	49.5	33.9	12.3	13.4	InternLM2 Technical Report
Vicuna-7b-v1.5-16k	1.9	1.0	37.0	11.1	5.8	3.2	1.8	Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
LongChat-7b-v1.5-32k	1.6	0.8	32.4	10.7	5.7	3.1	1.9	Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Vicuna-13b-v1.5-16k	1.4	0.9	53.4	29.2	13.1	4.3	2.2	Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
ChatGLM3-6b-32k	0.9	0.5	39.8	18.8	9.0	5.0	3.4	GLM-130B: An Open Bilingual Pre-trained Model
ChatGLM2-6b-32k	0.0	0.3	31.2	10.9	4.5	1.6	1.6	GLM-130B: An Open Bilingual Pre-trained Model

0 of 10 row(s) selected.