Command Palette
Search for a command to run...
Institutional Books 1.0 Book Dataset
Date
Paper URL
Institutional Books 1.0 is a growing corpus of public domain books to be released by Harvard University in 2025. The related paper results are:Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability". The dataset consists of 983,004 public domain books in 254 languages, mainly published in the 19th and 20th centuries. The dataset has 242 billion tokens, 386 million pages of text, and is available in both original and post-processed OCR export formats.
Citation
@misc{cargnelutti2025institutionalbooks10242b, title={Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability}, author={Matteo Cargnelutti and Catherine Brobston and John Hess and Jack Cushman and Kristi Mukk and Aristana Scourtas and Kyle Courtney and Greg Leppert and Amanda Watson and Martha Whitehead and Jonathan Zittrain}, year={2025}, eprint={2506.08300}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.08300}, }
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.