HyperAI

Among the many applications of artificial intelligence, OCR (Optical Character Recognition) is undoubtedly one of the most mature and practical technologies.The core goal of OCR is to automatically convert characters in images, scanned documents, street scenes, bills, and even handwritten text into editable and searchable digital text.Early OCR relied heavily on rules and templates, had limited functionality, and was often only able to recognize printed characters. However, with the introduction of deep learning, especially convolutional neural networks (CNNs) and sequence modeling methods, OCR's recognition accuracy and scope of application have achieved a qualitative leap.

Today, OCR has been widely used in various scenarios such as automated processing of financial bills, identity document review, license plate recognition, e-book digitization, intelligent translation, and medical document entry.Research and industry have also produced a series of representative models and frameworks.For example, CRNN (Convolutional Recurrent Neural Network) laid the foundation for the classic paradigm of end-to-end text recognition, and structures such as TPS-ResNet-BiLSTM-Attention have promoted the development of text recognition in complex scenes. From the revolutionary technology model InkSight released by Google to the recently launched lightweight models POINTS-Reader and Granite-docling,OCR technology has shown great potential in lightweight, cross-language, and multimodal recognition tasks.

Currently, the "Tutorials" section of HyperAI's official website has launched multiple open-source OCR model tutorials. If you want to experience the powerful capabilities of OCR technology for efficient image and text information extraction, scene recognition, and multi-language and multi-format matching, please visit the hyper.ai tutorial section to explore the one-click start tutorial!

1. POINTS-Reader:Lightweight model with no distillation and end-to-end

* Online operation:https://go.hyper.ai/amhh4

Jointly launched by Tencent, Shanghai Jiao Tong University, and Tsinghua University, this model is a lightweight vision-language model (VLM) designed specifically for document image-to-text conversion. Using a two-stage self-evolutionary framework, it achieves high-precision end-to-end recognition of complex Chinese and English documents (including tables, formulas, and multi-column layouts) while maintaining a minimalist structure.

2. Granite-docling-258M: A lightweight multimodal document processing model

* Online operation:https://go.hyper.ai/BBXlC

* Step-by-step tutorial:Redefining the next generation of OCR: IBM's newly open-source Granite-docling-258M enables end-to-end unified understanding of "structure + content."

Launched by IBM in September 2025, this lightweight visual language model is designed for efficient document conversion. Containing only 258M parameters, the model offers exceptional performance and cost-effectiveness, supporting multiple languages (including Arabic, Chinese, and Japanese). It converts documents into a machine-readable format while preserving layouts, tables, formulas, and other elements. The DocTags format used accurately describes document structure, preventing information loss.

3. dots.ocr: a multilingual document parsing model

* Online operation:https://go.hyper.ai/o0Bm0

* Step-by-step tutorial:Online Tutorial | Breaking through the reliance on structured documents, dots.ocr achieves state-of-the-art OCR performance in hundreds of languages based on 1.7B parameters.

This model, released by Xiaohongshu's hi lab in August 2025, is a multilingual document layout parsing model. Based on a 1.7 billion-parameter VLM, it integrates layout detection and content recognition, maintaining a good reading order. Despite its small size, it achieves state-of-the-art performance, achieving excellent results on benchmarks such as OmniDocBench. Its formula recognition rivals Doubao-1.5 and Gemini2.5-Pro, and it demonstrates significant advantages in parsing minority languages. The model boasts a simple and efficient architecture, with task switching requiring only a change in the prompt word. This results in fast inference speed, making it suitable for a variety of document parsing scenarios.

4. MonkeyOCR: Document parsing based on structure-recognition-relationship

* Online operation:https://go.hyper.ai/2SDMC

* Step-by-step tutorial:With 2.6k stars, MonkeyOCR-3B surpasses 72B model in English document parsing task and reaches SOTA performance

This document parsing model, jointly open-sourced by Huazhong University of Science and Technology and Kingsoft Office, efficiently converts unstructured content into structured information. Relying on precise layout analysis, content recognition, and logical ordering, it significantly improves parsing accuracy and efficiency. Performance improves by an average of 5.11 TP3T for complex documents, 15.01 TP3T for formula parsing, and 8.61 TP3T for table parsing. Its multi-page processing speed reaches 0.84 pages per second, far exceeding similar tools. Supporting a wide range of document types and languages, it is suitable for use in scenarios such as theses, textbooks, and newspapers, providing strong support for document digitization and automation.

5. GOT-OCR-2.0: The world's first universal end-to-end OCR model

* Online operation:https://go.hyper.ai/NGNZi

Jointly developed by StepFun, Megvii Technology, the University of the Chinese Academy of Sciences, and Tsinghua University, this unified end-to-end model, based on universal OCR theory, employs an integrated architecture to significantly improve OCR accuracy and efficiency. The model is both flexible and adaptable, supporting scene text recognition and efficiently processing multi-page documents, making it suitable for a variety of complex application scenarios.

6. InkSight Demo: Digitizing Handwritten Text

* Online operation:https://go.hyper.ai/LofxZ

* Step-by-step tutorial:Beyond traditional OCR! One-click deployment of Google's latest achievement InkSight: Accurate recognition of handwritten text, no pressure on both Chinese and English

This revolutionary AI technology, launched by Google Research in 2024, mimics the human reading and learning process by continuously rewriting and learning handwritten text, thereby accumulating an understanding of the text's appearance and meaning. Humans can read InkSight-generated text tracings with an accuracy of up to 871 TP3T. InkSight demonstrates even higher recognition accuracy when dealing with handwritten text against complex backgrounds, in blurry conditions, or in low-light conditions.

A Summary of Six Major OCR Models, Open Sourced by Google, IBM, Tencent, Xiaohongshu, and Tsinghua University, With Lightweight Architectures That Boost Recognition Accuracy and efficiency.