HyperAI超神経

Visual Question Answering Vqa On

評価指標

ANLS

評価結果

このベンチマークにおける各モデルのパフォーマンス結果

モデル名
ANLS
Paper TitleRepository
GPT-3.5 + LATIN-Prompt48.98Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
Gemini Ultra (pixel only)80.3Gemini: A Family of Highly Capable Multimodal Models
DUBLIN36.82DUBLIN -- Document Understanding By Language-Image Network-
PaLI-X (Single-task FT)49.2PaLI-X: On Scaling up a Multilingual Vision and Language Model
Pix2Struct-base38.2Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
PaLI-357.8PaLI-3 Vision Language Models: Smaller, Faster, Stronger
DUBLIN (variable resolution)42.6DUBLIN -- Document Understanding By Language-Image Network-
PaLI-X (Multi-task FT)50.7PaLI-X: On Scaling up a Multilingual Vision and Language Model
PaLI-X (Single-task FT w/ OCR)54.8PaLI-X: On Scaling up a Multilingual Vision and Language Model
SMoLA-PaLI-X Specialist66.2Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts-
Pix2Struct-large40Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
DocFormerv2-large48.8DocFormerv2: Local Features for Document Understanding
MatCha37.2MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering
ChatGPT 3.5 with LAPDoc Prompt (SpatialFormat)54.9LAPDoc: Layout-Aware Prompting for Documents-
Claude + LATIN-Prompt54.51Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
TILT-Large61.20Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
UDOP47.4Unifying Vision, Text, and Layout for Universal Document Processing
ScreenAI 5B (4.62 B params, w/ OCR)65.90ScreenAI: A Vision-Language Model for UI and Infographics Understanding
SMoLA-PaLI-X Generalist65.6Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts-
PaLI-3 (w/ OCR)62.4PaLI-3 Vision Language Models: Smaller, Faster, Stronger
0 of 21 row(s) selected.