Date

a year ago

Size

52.16 GB

Organization

Publish URL

github.com

Paper URL

arxiv.org

License

CC BY 4.0

Dataset characteristics

The dataset contains about 250k pages of PDF content, covering academic papers, legal documents, manuals and other types. The dataset not only contains text content, but also extracts the coordinate information of significant elements (such as text blocks and images) in each page. This information is dynamically injected into the model prompt, significantly reducing the model's hallucinations. This dataset can be used to train, fine-tune or evaluate your own OCR document processing pipeline. In addition, the dataset is annotated using GPT-4o to ensure high quality and consistency of annotations. The data comes from a wide range of sources, including PDF documents crawled from public websites and books from the Internet Archive. The dataset not only contains text content, but also extracts coordinate information of salient elements (such as text blocks and images) on each page. This information is dynamically injected into the model prompt, significantly reducing the model's hallucinations.

olmOCR-mix-0225.torrent

Seeding 1Downloading 0Completed 279Total Downloads 415

olmOCR-mix-0225/
- README.md
  1.87 KB
- README.txt
  3.73 KB

This dataset is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at [email protected] for prompt review and removal.

Related Datasets

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Use this Dataset

Discuss on Discord

Date

a year ago

Size

52.16 GB

Organization

Publish URL

github.com

Paper URL

arxiv.org

License

CC BY 4.0

Dataset characteristics

olmOCR-mix-0225.torrent

Seeding 1Downloading 0Completed 279Total Downloads 415

olmOCR-mix-0225/
- README.md
  1.87 KB
- README.txt
  3.73 KB

Related Datasets

Creative Professionals Creative Task Instruction Dataset

2 months ago

LightOnOCR-mix-0126 Text Transcription Dataset

5 months ago

TransPhy3D Transparent Reflection Synthesis Video Dataset

5 months ago

MCIF Multimodal Cross-Language Instruction Following Dataset

6 months ago

MCD-rPPG Multi-Camera Remote Photoplethysmography Dataset

6 months ago

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

olmOCR-mix-0225 Large-scale PDF Document Dataset

Dataset characteristics

Build AI with AI

HyperAI Newsletters

Command Palette

olmOCR-mix-0225 Large-scale PDF Document Dataset

Dataset characteristics

Related Datasets

Creative Professionals Creative Task Instruction Dataset

LightOnOCR-mix-0126 Text Transcription Dataset

TransPhy3D Transparent Reflection Synthesis Video Dataset

MCIF Multimodal Cross-Language Instruction Following Dataset

MCD-rPPG Multi-Camera Remote Photoplethysmography Dataset

Build AI with AI

HyperAI Newsletters

Command Palette

olmOCR-mix-0225 Large-scale PDF Document Dataset

Dataset characteristics

Related Datasets

Creative Professionals Creative Task Instruction Dataset

LightOnOCR-mix-0126 Text Transcription Dataset

TransPhy3D Transparent Reflection Synthesis Video Dataset

MCIF Multimodal Cross-Language Instruction Following Dataset

MCD-rPPG Multi-Camera Remote Photoplethysmography Dataset

Build AI with AI

HyperAI Newsletters

Related Datasets

Creative Professionals Creative Task Instruction Dataset

LightOnOCR-mix-0126 Text Transcription Dataset

TransPhy3D Transparent Reflection Synthesis Video Dataset

MCIF Multimodal Cross-Language Instruction Following Dataset

MCD-rPPG Multi-Camera Remote Photoplethysmography Dataset

Related Datasets

Creative Professionals Creative Task Instruction Dataset

LightOnOCR-mix-0126 Text Transcription Dataset

TransPhy3D Transparent Reflection Synthesis Video Dataset

MCIF Multimodal Cross-Language Instruction Following Dataset

MCD-rPPG Multi-Camera Remote Photoplethysmography Dataset