HyperAIHyperAI

Command Palette

Search for a command to run...

olmOCR-mix-1025 Document Recognition Dataset

Date

13 hours ago

Organization

Allen Institute for Artificial Intelligence

Paper URL

2502.18443

License

Other

Join the Discord Community

olmOCR-mix-1025 is a large-scale, high-quality PDF document OCR dataset released by the Allen Institute for AI in 2025. The related paper is titled "olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language ModelsThe system aims to support the training, fine-tuning, and evaluation of optical character recognition (OCR) models, document understanding models, and multimodal large models.

This dataset contains approximately 270,250 pages of PDF documents, with 267,962 pages in the training set and 2,288 pages in the evaluation set. It covers a variety of document types, including academic papers, archival documents, scanned book texts, and historical manuscripts. Each subset is predominantly in English, with an overall proportion between 91% and 99%, and also includes a small number of documents in Spanish, French, German, Italian, Latin, and Indonesian.

Dataset distribution

  • 00_documents (General Documents): 232,790 pages in total (231,668 training sessions / 1,122 assessment sessions), with the following language distribution: English 94.46%, Spanish 0.58%, French 0.46%, Indonesian 0.45%, and German 0.42%.
  • 01_books (Books and Documents): 17,474 pages in total (16,575 training / 899 assessments), with the following language distribution: English 91.28%, French 0.54%, Latin 0.31%, German 0.27%, and Hindi 0.12%.
  • 02_loc_transcripts (Congressional Records/Speech Transcripts): 9,989 pages total (9,891 for training / 98 for evaluation), with the following language distribution: English 98.21%, Spanish 0.59%, French 0.46%, German 0.45%, and Italian 0.11%.
  • 03_national_archives: 9,997 pages in total (9,828 trainings / 169 assessments), with the following language distribution: English 99.82%, Spanish 0.12%, French 0.02%, Swedish 0.01%, and German 0.01%.

Compared to the previous version olmOCR-mix-0225, olmOCR-mix-1025 further improves annotation quality and document coverage. This version uses GPT-4.1 and an improved prompting strategy to generate OCR, making the text reading order more consistent with the original layout and preserving the born-digital content structure. Meanwhile, mathematical formulas in the dataset have been standardized, tables are presented in HTML, and basic image Alt text has been added. Furthermore, samples of books, archives, and handwritten documents have been added, making it more suitable for robust model training in document-based scenarios.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
olmOCR-mix-1025 Document Recognition Dataset | Datasets | HyperAI