HyperAI
HyperAI超神経
ホーム
プラットフォーム
ドキュメント
ニュース
論文
チュートリアル
データセット
百科事典
SOTA
LLMモデル
GPU ランキング
学会
検索
サイトについて
利用規約
プライバシーポリシー
日本語
HyperAI
HyperAI超神経
Toggle Sidebar
サイトを検索…
⌘
K
Command Palette
Search for a command to run...
プラットフォーム
ホーム
SOTA
ビジュアルクエスチョンアンサリング
Visual Question Answering Vqa On
Visual Question Answering Vqa On
評価指標
ANLS
評価結果
このベンチマークにおける各モデルのパフォーマンス結果
Columns
モデル名
ANLS
Paper Title
Gemini Ultra (pixel only)
80.3
Gemini: A Family of Highly Capable Multimodal Models
SMoLA-PaLI-X Specialist
66.2
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts
ScreenAI 5B (4.62 B params, w/ OCR)
65.90
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
SMoLA-PaLI-X Generalist
65.6
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts
UDOP (aux)
63.0
Unifying Vision, Text, and Layout for Universal Document Processing
PaLI-3 (w/ OCR)
62.4
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
TILT-Large
61.20
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
PaLI-3
57.8
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
ChatGPT 3.5 with LAPDoc Prompt (SpatialFormat)
54.9
LAPDoc: Layout-Aware Prompting for Documents
PaLI-X (Single-task FT w/ OCR)
54.8
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Claude + LATIN-Prompt
54.51
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
PaLI-X (Multi-task FT)
50.7
PaLI-X: On Scaling up a Multilingual Vision and Language Model
PaLI-X (Single-task FT)
49.2
PaLI-X: On Scaling up a Multilingual Vision and Language Model
GPT-3.5 + LATIN-Prompt
48.98
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
DocFormerv2-large
48.8
DocFormerv2: Local Features for Document Understanding
UDOP
47.4
Unifying Vision, Text, and Layout for Universal Document Processing
DUBLIN (variable resolution)
42.6
DUBLIN -- Document Understanding By Language-Image Network
Pix2Struct-large
40
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Pix2Struct-base
38.2
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
MatCha
37.2
MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering
0 of 21 row(s) selected.
Previous
Next
Visual Question Answering Vqa On | SOTA | HyperAI超神経