olmOCR-mix-0225 Large-scale PDF Document Dataset
Date
Size
Publish URL
License
CC BY 4.0
Categories
olmOCR-mix-0225 is a large-scale, high-quality PDF document dataset designed for training and optimizing optical character recognition (OCR) models. The dataset was released by the Allen Institute for AI in 2025.olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models".
Dataset characteristics
The dataset contains about 250k pages of PDF content, covering academic papers, legal documents, manuals and other types. The dataset not only contains text content, but also extracts the coordinate information of significant elements (such as text blocks and images) in each page. This information is dynamically injected into the model prompt, significantly reducing the model's hallucinations. This dataset can be used to train, fine-tune or evaluate your own OCR document processing pipeline.
In addition, the dataset is annotated using GPT-4o to ensure high quality and consistency of annotations. The data comes from a wide range of sources, including PDF documents crawled from public websites and books from the Internet Archive. The dataset not only contains text content, but also extracts coordinate information of salient elements (such as text blocks and images) on each page. This information is dynamically injected into the model prompt, significantly reducing the model's hallucinations.