Visual Question Answering Vqa On

Metrics

ANLS

Results

Performance results of various models on this benchmark

		Paper Title
Gemini Ultra (pixel only)	80.3	Gemini: A Family of Highly Capable Multimodal Models
SMoLA-PaLI-X Specialist	66.2	Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts
ScreenAI 5B (4.62 B params, w/ OCR)	65.90	ScreenAI: A Vision-Language Model for UI and Infographics Understanding
SMoLA-PaLI-X Generalist	65.6	Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts
UDOP (aux)	63.0	Unifying Vision, Text, and Layout for Universal Document Processing
PaLI-3 (w/ OCR)	62.4	PaLI-3 Vision Language Models: Smaller, Faster, Stronger
TILT-Large	61.20	Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
PaLI-3	57.8	PaLI-3 Vision Language Models: Smaller, Faster, Stronger
ChatGPT 3.5 with LAPDoc Prompt (SpatialFormat)	54.9	LAPDoc: Layout-Aware Prompting for Documents
PaLI-X (Single-task FT w/ OCR)	54.8	PaLI-X: On Scaling up a Multilingual Vision and Language Model
Claude + LATIN-Prompt	54.51	Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
PaLI-X (Multi-task FT)	50.7	PaLI-X: On Scaling up a Multilingual Vision and Language Model
PaLI-X (Single-task FT)	49.2	PaLI-X: On Scaling up a Multilingual Vision and Language Model
GPT-3.5 + LATIN-Prompt	48.98	Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
DocFormerv2-large	48.8	DocFormerv2: Local Features for Document Understanding
UDOP	47.4	Unifying Vision, Text, and Layout for Universal Document Processing
DUBLIN (variable resolution)	42.6	DUBLIN -- Document Understanding By Language-Image Network
Pix2Struct-large	40	Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Pix2Struct-base	38.2	Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
MatCha	37.2	MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering

0 of 21 row(s) selected.

Command Palette

Visual Question Answering Vqa On

Metrics

Results