HyperAI

olmOCR-mix-0225 Large-scale PDF Document Dataset

Date

4 months ago

Size

52.16 GB

Organization

Allen Institute for Artificial Intelligence

Publish URL

github.com

License

CC BY 4.0

olmOCR-mix-0225 is a large-scale, high-quality PDF document dataset designed for training and optimizing optical character recognition (OCR) models. The dataset was released by the Allen Institute for AI in 2025.olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models".

Dataset characteristics

The dataset contains about 250k pages of PDF content, covering academic papers, legal documents, manuals and other types. The dataset not only contains text content, but also extracts the coordinate information of significant elements (such as text blocks and images) in each page. This information is dynamically injected into the model prompt, significantly reducing the model's hallucinations. This dataset can be used to train, fine-tune or evaluate your own OCR document processing pipeline.

In addition, the dataset is annotated using GPT-4o to ensure high quality and consistency of annotations. The data comes from a wide range of sources, including PDF documents crawled from public websites and books from the Internet Archive. The dataset not only contains text content, but also extracts coordinate information of salient elements (such as text blocks and images) on each page. This information is dynamically injected into the model prompt, significantly reducing the model's hallucinations.

olmOCR-mix-0225.torrent
Seeding 2Downloading 0Completed 113Total Downloads 123
  • olmOCR-mix-0225/
    • README.md
      1.87 KB
    • README.txt
      3.73 KB
      • data/
        • olmOCR-mix-0225.zip
          52.16 GB