HyperAI
HyperAI초신경
홈
플랫폼
문서
뉴스
연구 논문
튜토리얼
데이터셋
백과사전
SOTA
LLM 모델
GPU 랭킹
컨퍼런스
전체 검색
소개
서비스 약관
개인정보 처리방침
한국어
HyperAI
HyperAI초신경
Toggle Sidebar
전체 사이트 검색...
⌘
K
Command Palette
Search for a command to run...
플랫폼
홈
SOTA
시각적 질문 응답 (VQA)
Visual Question Answering Vqa On
Visual Question Answering Vqa On
평가 지표
ANLS
평가 결과
이 벤치마크에서 각 모델의 성능 결과
Columns
모델 이름
ANLS
Paper Title
Gemini Ultra (pixel only)
80.3
Gemini: A Family of Highly Capable Multimodal Models
SMoLA-PaLI-X Specialist
66.2
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts
ScreenAI 5B (4.62 B params, w/ OCR)
65.90
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
SMoLA-PaLI-X Generalist
65.6
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts
UDOP (aux)
63.0
Unifying Vision, Text, and Layout for Universal Document Processing
PaLI-3 (w/ OCR)
62.4
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
TILT-Large
61.20
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
PaLI-3
57.8
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
ChatGPT 3.5 with LAPDoc Prompt (SpatialFormat)
54.9
LAPDoc: Layout-Aware Prompting for Documents
PaLI-X (Single-task FT w/ OCR)
54.8
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Claude + LATIN-Prompt
54.51
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
PaLI-X (Multi-task FT)
50.7
PaLI-X: On Scaling up a Multilingual Vision and Language Model
PaLI-X (Single-task FT)
49.2
PaLI-X: On Scaling up a Multilingual Vision and Language Model
GPT-3.5 + LATIN-Prompt
48.98
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
DocFormerv2-large
48.8
DocFormerv2: Local Features for Document Understanding
UDOP
47.4
Unifying Vision, Text, and Layout for Universal Document Processing
DUBLIN (variable resolution)
42.6
DUBLIN -- Document Understanding By Language-Image Network
Pix2Struct-large
40
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Pix2Struct-base
38.2
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
MatCha
37.2
MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering
0 of 21 row(s) selected.
Previous
Next
Visual Question Answering Vqa On | SOTA | HyperAI초신경