2 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Guilherme Penedo, Hynek Kydl\u00ed\u010dek, Vinko Sabol\u010dec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, Thomas Wolf

View Paper Details

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data
Processing to Every Language

Abstract

Pre-training state-of-the-art large language models (LLMs) requires vastamounts of clean and diverse text data. While the open development of largehigh-quality English pre-training datasets has seen substantial recentprogress, training performant multilingual LLMs remains a challenge, in largepart due to the inherent difficulty of tailoring filtering and deduplicationpipelines to a large number of languages. In this work, we introduce a newpre-training dataset curation pipeline based on FineWeb that can beautomatically adapted to support any language. We extensively ablate ourpipeline design choices on a set of nine diverse languages, guided by a set ofmeaningful and informative evaluation tasks that were chosen through a novelselection process based on measurable criteria. Ultimately, we show that ourpipeline can be used to create non-English corpora that produce more performantmodels than prior datasets. We additionally introduce a straightforward andprincipled approach to rebalance datasets that takes into consideration bothduplication count and quality, providing an additional performance uplift.Finally, we scale our pipeline to over 1000 languages using almost 100 CommonCrawl snapshots to produce FineWeb2, a new 20 terabyte (5 billion document)multilingual dataset which we release along with our pipeline, training, andevaluation codebases.