Visual Question Answering On Mm Vet V2

GPT-4 score

Results

Performance results of various models on this benchmark

Model Name	GPT-4 score	Paper Title	Repository
InternVL-Chat-V1-2	45.5±0.1	-	-
InternVL2-Llama3-76B	68.4±0.3	-	-
Claude 3.5 Sonnet (claude-3-5-sonnet-20240620)	71.8±0.2	Claude 3.5 Sonnet Model Card Addendum	-
CogVLM-Chat	45.1±0.2	CogVLM: Visual Expert for Pretrained Language Models
LLaVA-NeXT-34B	50.9±0.1	-	-
Qwen-VL-Max	55.8±0.2	Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Emu2-Chat	38.0±0.1	Generative Multimodal Models are In-Context Learners
Otter-9B	23.2±0.1	MIMIC-IT: Multi-Modal In-Context Instruction Tuning
gemini-2.0-flash-exp	77.1±0.1	-	-
InternVL2-40B	63.8±0.2	-	-
Gemini Pro Vision	57.2±0.2	Gemini: A Family of Highly Capable Multimodal Models
OpenFlamingo-9B	17.6±0.2	OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
GPT-4o (gpt-4o-2024-11-20)	72.1±0.2	GPT-4 Technical Report
LLaVA-v1.5-13B	33.2±0.1	Improved Baselines with Visual Instruction Tuning
Claude 3 Opus (claude-3-opus-20240229)	55.8±0.2	-	-
InternVL-Chat-V1-5	51.5±0.2	How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
IXC2-VL-7B	42.5±0.3	InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
GPT-4o (gpt-4o-2024-05-13)	71.0±0.2	GPT-4 Technical Report
LLaVA-v1.5-7B	28.3±0.2	Improved Baselines with Visual Instruction Tuning
Qwen2-VL-72B (qwen-vl-max-0809)	66.9±0.3	Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

0 of 24 row(s) selected.