Command Palette
Search for a command to run...
Institutional Books 1.0 是由哈佛大学于 2025 年发布一个不断增长的公有领域书籍语料库,相关论文成果为:「Institutional Books 1.0: A 242B token dataset from Harvard Library’s collections, refined for accuracy and usability」。 该数据集由 983,004 本公有领域书籍组成,书籍涵盖 254 种语言,主要出版于 19 世纪和 20 世纪。该数据集拥有 2420 亿 token 标记、 3.86 亿页文本,有原始和后处理的 OCR 导出两种格式。
Citation
@misc{cargnelutti2025institutionalbooks10242b, title={Institutional Books 1.0: A 242B token dataset from Harvard Library’s collections, refined for accuracy and usability}, author={Matteo Cargnelutti and Catherine Brobston and John Hess and Jack Cushman and Kristi Mukk and Aristana Scourtas and Kyle Courtney and Greg Leppert and Amanda Watson and Martha Whitehead and Jonathan Zittrain}, year={2025}, eprint={2506.08300}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.08300}, }