Command Palette
Search for a command to run...
Extract-0 Document Information Extraction Dataset
Date
Size
Paper URL
License
Apache 2.0
Extract-0 is a high-quality training and evaluation dataset designed for document information extraction tasks, released by Inteli in 2025. The related paper results are "Extract-0: A Specialized Language Model for Document Information Extraction", which aims to support research on performance optimization of small-scale parameter models in complex extraction tasks.
This dataset contains 280,128 document extraction examples, derived from 34,761 document chunks. Each example has an average length of approximately 532–1900 tokens and covers a variety of data structures (such as objects, arrays, strings, dates, and numbers). The data comes from text data collected from arXiv academic papers, PubMed Central, Wikipedia entries, and the FDA (U.S. Food and Drug Administration) database. Each example consists of an original document fragment, its corresponding schema-based extraction task, and its structured output, providing a unified extraction training standard across multiple domains and formats.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.