HyperAI

MonkeyOCR: Document Parsing Based on the Structure-recognition-relation Triple Paradigm

1. Tutorial Introduction

Build

MonkeyOCR is a document parsing model open-sourced by Huazhong University of Science and Technology and Kingsoft Office on June 5, 2025. The model supports the efficient conversion of unstructured document content into structured information. Based on precise layout analysis, content recognition, and logical sorting, the accuracy and efficiency of document parsing are significantly improved. Compared with traditional methods, MonkeyOCR performs well in processing complex documents (such as documents containing formulas and tables), with an average performance improvement of 5.1%, and improvements of 15.0% and 8.6% in formula and table parsing, respectively. The model performs well in processing multi-page documents, reaching 0.84 pages per second, far exceeding other similar tools. MonkeyOCR supports a variety of document types, including academic papers, textbooks, newspapers, etc., and is suitable for multiple languages, providing strong support for document digitization and automated processing. The relevant paper results are "MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm".

Key features:

  • Document parsing and structuring: Convert unstructured content (including text, tables, formulas, images, etc.) in documents of various formats (such as PDF, images, etc.) into structured machine-readable information.
  • Multi-language support: Supports multiple languages, including Chinese and English.
  • Efficiently handle complex documents: It performs well when processing complex documents (such as those containing formulas, tables, multi-column layouts, etc.).
  • Fast multi-page document processing: Efficiently process multi-page documents with a processing speed of 0.84 pages per second, significantly better than other tools (such as MinerU 0.65 pages per second Qwen2.5-VL-7B 0.12 pages per second).
  • Flexible deployment and expansion: Supports efficient deployment on a single NVIDIA 3090 GPU to meet needs of different scales.

Technical principle:

  • Structure-Recognition-Relation (SRR) triplet paradigm: A document layout detector based on YOLO that identifies the location and category of key elements in a document (such as text blocks, tables, formulas, images, etc.). Content recognition is performed on each detected region, and end-to-end recognition is performed using a large multi-morphic model (LMM) to ensure high accuracy. Based on a block-level reading order prediction mechanism, the logical relationship between the detected elements is determined to reconstruct the semantic structure of the document.
  • MonkeyDoc dataset: MonkeyDoc is the most comprehensive document parsing dataset to date, containing 3.9 million instances, covering more than ten document types in Chinese and English. The dataset is built based on a multi-stage pipeline, integrating careful manual annotation, programmatic synthesis, and model-driven automatic annotation. It is used to train and evaluate MonkeyOCR models, ensuring strong generalization capabilities in diverse and complex document scenarios.
  • Model optimization and deployment: The AdamW optimizer and cosine learning rate scheduling are used in combination with large-scale datasets for training to ensure a balance between model accuracy and efficiency. Based on the LMDeplov tool, MonkeyOCR can run efficiently on a single NVIDIA 3090 GPU, supporting fast reasoning and large-scale deployment.

The computing resources used in this tutorial are a single RTX 4090 card.

2. Effect display

Formula Document Example

7jVLgB.jpg

Table document example

7jcOaa.png

Newspaper Example

7jcP5V.png

Financial Report Example

7jc10I.png
7jcRCL.png

3. Operation steps

1. Start the container

2. Usage steps

If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 2-3 minutes and refresh the page.

4. Discussion

🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓

Citation Information

The citation information for this project is as follows:

@misc{li2025monkeyocrdocumentparsingstructurerecognitionrelation,
      title={MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm}, 
      author={Zhang Li and Yuliang Liu and Qiang Liu and Zhiyin Ma and Ziyang Zhang and Shuo Zhang and Zidun Guo and Jiarui Zhang and Xinyu Wang and Xiang Bai},
      year={2025},
      eprint={2506.05218},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.05218}, 
}