Long Context Understanding On Mmneedle

1 Image, 2*2 Stitching, Exact Accuracy

1 Image, 4*4 Stitching, Exact Accuracy

1 Image, 8*8 Stitching, Exact Accuracy

10 Images, 1*1 Stitching, Exact Accuracy

10 Images, 2*2 Stitching, Exact Accuracy

10 Images, 4*4 Stitching, Exact Accuracy

10 Images, 8*8 Stitching, Exact Accuracy

평가 결과

이 벤치마크에서 각 모델의 성능 결과

								Paper Title
GPT-4o	94.6	83	19	97	81.8	26.9	1	GPT-4 Technical Report
Gemini Pro 1.5	90.34	39.85	29.81	89.94	45.21	6.09	0.62	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
GPT-4V	86.09	54.72	7.3	72.36	34.24	7.58	0	GPT-4 Technical Report
Claude 3 Opus	52.25	12.3	1.6	66.93	4.6	0.4	0	The Claude 3 Model Family: Opus, Sonnet, Haiku
LLaVA-Llama-3	43.8	17.5	3.3	0	0	0	0	LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Gemini Pro 1.0	29.53	24.78	2.11	16.25	4.82	0.4	0	Gemini: A Family of Highly Capable Multimodal Models
IDEFICS2-8B	18.9	7.8	0.9	0	0	0	0	What matters when building vision-language models?
CogVLM2-Llama-3	7.3	0.9	0.1	0	0	0	0	CogVLM: Visual Expert for Pretrained Language Models
InstructBLIP-Flan-T5-XXL	3.8	6.2	2.2	0	0	0	0	InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
mPLUG-Owl-v2	1.9	0.3	0.7	0.4	0.1	0	0	mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
InstructBLIP-Vicuna-13B	0	0	0	0	0	0	0	InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
CogVLM-17B	0	0.1	0.3	0	0	0	0	CogVLM: Visual Expert for Pretrained Language Models

0 of 12 row(s) selected.