Visual Question Answering Vqa On Core Mm

Metrics

Abductive

Analogical

Deductive

Overall score

Params

Results

Performance results of various models on this benchmark

						Paper Title
GPT-4V	77.88	69.86	74.86	74.44	-	GPT-4 Technical Report
SPHINX v2	49.85	20.69	42.17	39.48	16B	SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
LLaVA-1.5	47.91	24.31	30.94	32.62	13B	Improved Baselines with Visual Instruction Tuning
CogVLM-Chat	47.88	28.75	36.75	37.16	17B	CogVLM: Visual Expert for Pretrained Language Models
LLaMA-Adapter V2	46.12	22.08	28.7	30.46	7B	LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Qwen-VL-Chat	44.39	30.42	37.55	37.39	16B	Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
InstructBLIP	37.76	20.56	27.56	28.02	8B	InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Emu	36.57	18.19	28.9	28.24	14B	Emu: Generative Pretraining in Multimodality
InternLM-XComposer-VL	35.97	18.61	26.77	26.84	9B	InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
Otter	33.64	13.33	22.49	22.69	7B	Otter: A Multi-Modal Model with In-Context Instruction Tuning
mPLUG-Owl2	20.6	7.64	23.43	20.05	7B	mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
BLIP-2-OPT2.7B	18.96	7.5	2.76	19.31	3B	BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
MiniGPT-v2	13.28	5.69	11.02	10.43	8B	MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
OpenFlamingo-v2	5.3	1.11	8.88	6.82	9B	OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

0 of 14 row(s) selected.

Visual Question Answering Vqa On Core Mm | SOTA | HyperAI