EleutherAI Unveils 8TB Open-Domain Dataset to Boost Transparent AI Model Training

EleutherAI, an AI research organization, has unveiled what it claims to be one of the largest collections of licensed and open-domain text for training AI models. Named The Common Pile v0.1, this dataset took approximately two years to compile and involved collaborations with AI startups such as Poolside and Hugging Face, as well as multiple academic institutions. Weighing in at 8 terabytes, the dataset was instrumental in the development of two new AI models from EleutherAI—Comma v0.1-1T and Comma v0.1-2T—both of which are said to match the performance of models trained on unlicensed, copyrighted data. In recent years, AI companies, including giants like OpenAI, have faced lawsuits over their data-sourcing practices. These companies often scrape the web, including copyrighted materials such as books and research journals, to build training datasets. Many AI firms argue that the U.S. legal doctrine of fair use protects them from liability, but this stance has led to increased scrutiny and legal challenges. Consequently, these lawsuits have severely diminished transparency among AI companies, making it harder for researchers to comprehend how these models function and identify their limitations, according to EleutherAI. Stella Biderman, EleutherAI’s executive director, highlighted this issue in a recent blog post on Hugging Face's platform. “Lawsuits have not meaningfully changed data sourcing practices, but they have drastically reduced the transparency companies engage in,” she wrote. “Researchers we've spoken to at some companies specifically mentioned lawsuits as a barrier to releasing their findings in data-intensive areas.” The Common Pile v0.1 is available for download from Hugging Face’s AI development platform and GitHub. Legal experts were consulted during its creation to ensure adherence to copyright laws. Sources include 300,000 public domain books digitized by the Library of Congress and the Internet Archive. Additionally, EleutherAI utilized Whisper, OpenAI’s open-source speech-to-text model, to transcribe audio content. EleutherAI asserts that Comma v0.1-1T and Comma v0.1-2T, both with 7 billion parameters, serve as proof that the Common Pile v0.1 was meticulously curated. Despite being trained on only a fraction of the dataset, these models perform comparably to Meta’s first Llama AI model in various benchmarks, including coding, image understanding, and mathematical tasks. Parameters, or weights, are the internal variables of an AI model that determine its behavior and output. Biderman emphasized in her post that the belief that unlicensed text is essential for high performance is unfounded. She argued, “As the volume of accessible openly licensed and public domain data increases, we can anticipate significant improvements in the quality of models trained on openly licensed content.” The Common Pile v0.1 also represents a step toward rectifying EleutherAI's past practices. Previously, the organization had released The Pile, an open collection of training text that included copyrighted material, which many AI companies have since used to develop their models and faced legal repercussions for doing so. Moving forward, EleutherAI is committed to the frequent release of open datasets, working closely with its research and infrastructure partners. This initiative aims to foster greater transparency and innovation in the AI community while navigating the complex landscape of copyright law.

EleutherAI Unveils 8TB Open-Domain Dataset to Boost Transparent AI Model Training

Related Links