HyperAIHyperAI

Command Palette

Search for a command to run...

Extract-0 Document Information Extraction Dataset

Date

19 days ago

Size

55.5 MB

Organization

Inteli

Paper URL

2509.22906

License

Apache 2.0

Extract-0 is a high-quality training and evaluation dataset designed for document information extraction tasks, released by Inteli in 2025. The related paper results are "Extract-0: A Specialized Language Model for Document Information Extraction", which aims to support research on performance optimization of small-scale parameter models in complex extraction tasks.

This dataset contains 280,128 document extraction examples, derived from 34,761 document chunks. Each example has an average length of approximately 532–1900 tokens and covers a variety of data structures (such as objects, arrays, strings, dates, and numbers). The data comes from text data collected from arXiv academic papers, PubMed Central, Wikipedia entries, and the FDA (U.S. Food and Drug Administration) database. Each example consists of an original document fragment, its corresponding schema-based extraction task, and its structured output, providing a unified extraction training standard across multiple domains and formats.

Extract-0.torrent
Seeding 1Downloading 0Completed 10Total Downloads 25
  • Extract-0/
    • README.md
      1.67 KB
    • README.txt
      3.34 KB
      • data/
        • Extract-0.zip
          55.5 MB

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp