4 months ago

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Fu Ling ; Kuang Zhebin ; Song Jiajun ; Huang Mingxin ; Yang Biao ; Li Yuzhe ; Zhu Linghao ; Luo Qidi ; Wang Xinyu ; Lu

Abstract

Scoring the Optical Character Recognition (OCR) capabilities of LargeMultimodal Models (LMMs) has witnessed growing interest. Existing benchmarkshave highlighted the impressive performance of LMMs in text recognition;however, their abilities in certain challenging tasks, such as textlocalization, handwritten content extraction, and logical reasoning, remainunderexplored. To bridge this gap, we introduce OCRBench v2, a large-scalebilingual text-centric benchmark with currently the most comprehensive set oftasks (4x more tasks than the previous multi-scene benchmark OCRBench), thewidest coverage of scenarios (31 diverse scenarios), and thorough evaluationmetrics, with 10,000 human-verified question-answering pairs and a highproportion of difficult samples. Moreover, we construct a private test set with1,500 manually annotated images. The consistent evaluation trends observedacross both public and private test sets validate the OCRBench v2'sreliability. After carefully benchmarking state-of-the-art LMMs, we find thatmost LMMs score below 50 (100 in total) and suffer from five-type limitations,including less frequently encountered text recognition, fine-grainedperception, layout perception, complex element parsing, and logical reasoning.The project website is at: https://99franklin.github.io/ocrbench_v2/

Code Repositories

yuliang-liu/multimodalocr

Official

pytorch

Mentioned in GitHub

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette