Gemini Ultra (pixel only) | 80.3 | Gemini: A Family of Highly Capable Multimodal Models | |
DUBLIN (variable resolution) | 42.6 | DUBLIN -- Document Understanding By Language-Image Network | - |
PaLI-X (Single-task FT w/ OCR) | 54.8 | PaLI-X: On Scaling up a Multilingual Vision and Language Model | |
ChatGPT 3.5 with LAPDoc Prompt (SpatialFormat) | 54.9 | LAPDoc: Layout-Aware Prompting for Documents | - |
ScreenAI 5B (4.62 B params, w/ OCR) | 65.90 | ScreenAI: A Vision-Language Model for UI and Infographics Understanding | |