HyperAIHyperAI

Command Palette

Search for a command to run...

À quel point l'OCR-Reasoning est-il robuste ? Évaluation de la robustesse de l'OCR-Reasoning des modèles de vision-langage sous perturbations visuelles

Yuxing Cheng Yuan Wu Yi Chang

Résumé

Les modèles vision-langage (VLM) ont obtenu de solides performances sur les benchmarks basés sur la reconnaissance optique de caractères (OCR) et portent un intérêt croissant à la compréhension de contenus textuels denses, mais leur robustesse face à une dégradation visuelle contrôlée reste insuffisamment comprise. Cette lacune est critique pour le raisonnement OCR, où une corruption visuelle peut induire des erreurs OCR et des distorsions structurelles, introduisant ainsi une incertitude dans la tâche de raisonnement. Afin d'étudier systématiquement ce problème, nous présentons OCR-Robust, un benchmark conçu pour évaluer la robustesse du raisonnement OCR sous perturbations visuelles. Il comprend 812 échantillons répartis en deux sous-ensembles complémentaires : OCR1.0, qui couvre les documents, le texte de scène, les reçus, l'écriture manuscrite et le contenu mathématique, et OCR2.0, qui se concentre sur les graphiques, les diagrammes géométriques et les tableaux. Pour permettre une évaluation à la fois efficace et informative, nous menons une étude pilote sur 18 perturbations candidates et sélectionnons 5 types représentatifs à 3 niveaux de sévérité chacun, en fonction de leur impact et de leur discriminabilité inter-modèles. Nous évaluons la robustesse à l'aide de la précision sur données propres, de la Rétention relative de corruption (RCR), de la Rétention dans le pire des cas (WCR) et d'un Indice de robustesse à la corruption (CRI) composite, et nous soumettons 18 modèles à un benchmark, couvrant des systèmes propriétaires, des VLM open-source et des pipelines OCR+LLM. Nos résultats montrent qu'une précision sur données propres plus élevée n'implique pas nécessairement une robustesse plus forte, et que les modèles peuvent subir une dégradation prononcée dans le pire des cas sur les tâches OCR sensibles à la structure, les graphiques et les tableaux étant nettement plus fragiles que les entrées de type document sous perturbation.

One-sentence Summary

The authors introduce OCR-Robust, a benchmark comprising 812 samples across OCR1.0 and OCR2.0 subsets that systematically evaluates 18 vision-language models under five visual perturbation types at three severity levels using clean accuracy, Relative Corruption Retention, Worst-Case Retention, and the Corruption Robustness Index to quantify robustness in text-rich reasoning.

Key Contributions

  • The paper introduces OCR-Robust, a benchmark containing 812 samples divided into OCR1.0 for natural text and OCR2.0 for structured visual content such as charts and tables. This dataset provides a controlled diagnostic setting to systematically evaluate how visual perturbations impact text-rich perception and downstream reasoning.
  • A principled perturbation selection pipeline evaluates 18 candidate visual corruptions to identify five discriminative types, each applied at three severity levels. This methodology establishes a reproducible framework for probing model sensitivity under controlled degradation.
  • The study develops a robustness metric suite comprising clean accuracy, Relative Corruption Retention (RCR), Worst-Case Retention (WCR), and a composite Corruption Robustness Index (CRI) that aggregates baseline capability, average retention, and worst-case risk. Evaluations across 18 proprietary, open-source, and OCR-plus-LLM models demonstrate how these metrics quantify performance degradation under visual corruption.

Introduction

Vision-Language Models have achieved impressive results on clean benchmarks for document understanding and text recognition, but their reliability often falters in practical applications where images suffer from blur, uneven lighting, or physical distortion. Prior work largely overlooks this gap by focusing evaluations on pristine data or general object recognition, neglecting the specific risk that visual corruption introduces discrete symbolic errors which can severely disrupt downstream reasoning processes. The authors present OCR-Robust, a benchmark designed to systematically evaluate OCR reasoning robustness under visual perturbations, featuring a diverse dataset of natural and structured content, a principled selection of five effective perturbation types, and a suite of metrics including Relative Corruption Retention and Worst-Case Retention to diagnose model vulnerabilities.

Dataset

  • Dataset Composition and Sources The authors introduce OCR-Robust, a benchmark containing 812 samples designed to evaluate OCR reasoning robustness under visual perturbations. The dataset comprises two complementary subsets: OCR1.0 with 482 samples covering documents, scene text, receipts, handwriting, and mathematical content, and OCR2.0 with 330 samples focusing on charts, geometry diagrams, and tables. OCR2.0 draws from ChartVQA, GNS-260K, and TableVQA-Bench. OCR1.0 integrates data from DocVQA, TextVQA, MTWI, SROIE, Screen2Words, M6Doc, OCR-Reasoning, and InfographicVQA, while the math subset is derived from GSM8K and TheoremQA.

  • Subset Details and Construction For OCR2.0, the authors apply visual diversity sampling by extracting ResNet50 features, performing agglomerative clustering with 150 clusters, and selecting samples closest to cluster centroids. OCR1.0 uses direct selection for handwriting samples and VLM-assisted annotation for other subsets. The authors generate reasoning questions using GPT-5.2, Gemini-3-Pro, and Claude-Sonnet-4.5, randomly selecting one model per image, followed by human verification. The math subset is synthetically constructed by converting source questions to free-form answers, rendering images via GPT-4o-image, and filtering with PaddleOCR to ensure character recall exceeds 0.9.

  • Usage and Evaluation Protocol This dataset serves as an evaluation benchmark rather than a training set. The authors assess model performance using a zero-shot evaluation protocol. They apply five perturbations selected from a pilot study of 18 candidates, each applied at three severity levels. The final perturbations include glass blur, color shift, elastic transform, motion blur, and snow. The authors implement perturbations using OpenCV and PIL with fixed random seeds for reproducibility and save perturbed images as PNGs to avoid compression artifacts. Severity calibration during the pilot phase utilized LPIPS distances to ensure perceptual uniformity across perturbation types.

  • Processing and Quality Control The authors enforce strict quality control, including manual review for answer verifiability, OCR necessity, and image-text consistency. They remove near-duplicates using perceptual hashing with a distance threshold of 5. Images must meet a minimum resolution of 224 by 224 pixels and an RMS contrast threshold greater than 15. For diversity sampling, images are pre-processed by resizing to 256 pixels and center-cropping to 224 pixels before feature extraction. The authors also filter synthetic math samples based on PaddleOCR character recall to maintain textual fidelity.

Method

The evaluation protocol begins with a rigorous answer normalization step to ensure consistent comparison between ground-truth and predicted responses. Both answers undergo a standardization process that involves converting text to lowercase, removing punctuation (with exceptions for decimal points and currency symbols), stripping leading and trailing whitespace, and collapsing multiple spaces into a single space. Following normalization, the system attempts a direct exact match.

For cases where exact matching fails—excluding multiple-choice questions—the protocol employs a fallback mechanism utilizing GPT-4o as a semantic judge. This approach is particularly useful for free-form answers where phrasing may vary despite preserving the underlying meaning. To verify the reliability of this automated judgment, the authors conducted a manual review of 367 instances where GPT-4o was invoked. The results demonstrated high stability, with a 99.18% agreement rate between GPT-4o and human annotators, yielding Cohen’s κ\kappaκ scores of 0.967 for OCR1.0 and 0.903 for OCR2.0. Only three judgments were found to be inconsistent during manual inspection.

To guide model outputs, the evaluation framework incorporates format hints derived from the specific answer type. These hints are appended uniformly to the prompts. For instance, monetary value questions might include a hint specifying the expected composition, such as "$ + Integer". The general prompt structure integrates the input image, the question text, and the specific format hint, instructing the model to output the answer directly without explanation.

Refer to the prompt template structure below:

As illustrated in the figure above, the prompt template ensures that the model receives clear instructions regarding the expected output format. While format hints may theoretically favor models trained on structured outputs, the authors observed no systematic bias, noting that performance gaps between models of similar architectures remained significant.

Experiment

The OCR-Robust benchmark evaluates closed-source vision-language models, open-source variants, and decoupled OCR-LLM pipelines under controlled visual perturbations to assess their reasoning resilience across natural scenes and structured documents. Results indicate that proprietary systems consistently demonstrate superior worst-case retention, while scaling model capacity or employing chain-of-thought reasoning enhances baseline accuracy without reliably mitigating corruption-induced failures. Additionally, structured visual content proves significantly more vulnerable to degradation than natural text, and modular pipelines remain constrained by optical character extraction bottlenecks rather than downstream reasoning capabilities. These findings establish that robust visual understanding requires dedicated architectural strategies beyond mere scale or prompting, positioning OCR robustness as a distinct evaluation challenge.

The authors investigate compound perturbations by evaluating triple combinations of visual degradations. The results show that adding multiple perturbation types does not consistently lead to proportionally larger performance drops. Combinations involving color shift and snow exhibit similar performance and sub-additive behavior, while a combination including pixelate achieves higher accuracy but a super-additive degradation ratio. Combinations involving color shift, glass blur, and snow show similar performance and sub-additive degradation ratios. The combination of elastic, glass blur, and pixelate achieves higher accuracy and a super-additive degradation ratio. The findings suggest that overlapping failure modes can limit the cumulative impact of multiple perturbations.

The authors evaluate the robustness of various multimodal language models and OCR pipelines against visual perturbations. The results demonstrate that closed-source systems generally maintain better performance stability under corruption than open-source counterparts. Additionally, the evaluation reveals that models are significantly more sensitive to perturbations affecting structured visual content compared to natural scene text. Closed-source models consistently show lighter colors in the heatmaps, indicating lower relative accuracy drops compared to open-source models. The heatmap for structured content displays darker shades overall, suggesting models suffer greater performance degradation on this type of data. Perturbations like color shift and snow are associated with darker regions across many models, indicating they cause more significant accuracy loss.

The authors conduct a pilot analysis on compound perturbations using GPT-5.1 to examine how multiple visual degradations interact. They evaluate pairwise combinations of five perturbation types and calculate a degradation ratio to determine if the combined effect is additive, sub-additive, or super-additive compared to individual perturbations. The combination of pixelate and snow results in the largest accuracy drop and a super-additive interaction, suggesting that texture damage and occlusion strongly compound each other. Pairs involving weaker perturbations, such as elastic or glass blur combined with others, frequently exhibit sub-additive behavior where the combined impact is less than the sum of individual effects. Combinations like color shift and snow demonstrate near-additive interactions, whereas other pairings show varying degrees of sub-additivity depending on the specific perturbation types involved.

The authors evaluate the effect of Chain of Thought (CoT) prompting versus standard prompting on three open-source vision-language models across two OCR reasoning datasets. The data shows that CoT consistently drives significant improvements in clean accuracy and generally raises the Corruption Robustness Index. However, the influence on specific retention metrics like worst-case and relative corruption retention is inconsistent, particularly on the second dataset, suggesting that extended reasoning primarily strengthens baseline performance rather than uniformly enhancing corruption resilience. Chain of Thought prompting consistently yields higher clean accuracy across all models and datasets compared to standard prompting. The Corruption Robustness Index typically improves with CoT, although worst-case retention benefits are not guaranteed for every model-dataset pair. Performance gains from CoT are more stable on the first dataset, while the second dataset reveals greater variability in retention metrics despite accuracy improvements.

The authors conducted a bootstrap stability analysis to assess the reliability of their perturbation selection by resampling pilot questions one thousand times. The the the table displays the frequency and selection rate for each candidate perturbation across these resamples. The data confirms that the final five perturbations were consistently chosen more often than other candidates, validating the experimental setup. Elastic deformation was selected most frequently, appearing in the vast majority of resampling iterations. The five perturbations marked as final selections consistently outperformed alternatives like defocus blur and various noise types. Color shift had a notably lower selection frequency compared to the top-ranked perturbations, suggesting it is less dominant in the selection criteria.

The experiments evaluate the robustness of multimodal and OCR models against single, pairwise, and triple visual perturbations, assess the impact of Chain of Thought prompting, and validate perturbation selection through bootstrap resampling. Results demonstrate that closed-source systems generally maintain greater stability than open-source models, particularly when processing structured visual content, while compound degradations rarely cause proportionally larger performance drops due to overlapping failure modes. Furthermore, Chain of Thought prompting consistently improves baseline accuracy and general robustness but does not uniformly enhance worst-case corruption retention, and the final perturbation suite shows high selection consistency across resampling trials, confirming the reliability of the experimental design.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp