Visual Question Answering On Mm Vet V2

GPT-4 score

評価結果

このベンチマークにおける各モデルのパフォーマンス結果

		Paper Title
gemini-2.0-flash-exp	77.1±0.1	-
GPT-4o (gpt-4o-2024-11-20)	72.1±0.2	GPT-4 Technical Report
Claude 3.5 Sonnet (claude-3-5-sonnet-20240620)	71.8±0.2	Claude 3.5 Sonnet Model Card Addendum
GPT-4o (gpt-4o-2024-05-13)	71.0±0.2	GPT-4 Technical Report
InternVL2-Llama3-76B	68.4±0.3	-
Qwen2-VL-72B (qwen-vl-max-0809)	66.9±0.3	Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Gemini 1.5 Pro	66.9±0.2	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
gpt-4o-mini-2024-07-18	66.8±0.3	GPT-4 Technical Report
GPT-4 Turbo (gpt-4-0125-preview)	66.3±0.2	GPT-4 Technical Report
InternVL2-40B	63.8±0.2	-
Gemini Pro Vision	57.2±0.2	Gemini: A Family of Highly Capable Multimodal Models
Qwen-VL-Max	55.8±0.2	Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Claude 3 Opus (claude-3-opus-20240229)	55.8±0.2	-
InternVL-Chat-V1-5	51.5±0.2	How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
LLaVA-NeXT-34B	50.9±0.1	-
InternVL-Chat-V1-2	45.5±0.1	-
CogVLM-Chat	45.1±0.2	CogVLM: Visual Expert for Pretrained Language Models
IXC2-VL-7B	42.5±0.3	InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
Emu2-Chat	38.0±0.1	Generative Multimodal Models are In-Context Learners
CogAgent-Chat	34.7±0.2	CogAgent: A Visual Language Model for GUI Agents

0 of 24 row(s) selected.