NVIDIA Enhances NeMo Curator with Nemotron-CC Pipeline for High-Quality, Scalable LLM Pretraining Data
NVIDIA's NeMo Curator team has announced the integration of the Nemotron-CC data curation pipeline into the NeMo Curator GitHub repository, marking a significant advancement in the creation of high-quality, large-scale datasets for training large language models (LLMs). Previously, NVIDIA released a 6.3-trillion-token English language dataset sourced from Common Crawl (CC), named Nemotron-CC. Now, the pipeline that created this dataset is publicly available, offering developers a robust toolset to balance the trade-offs between data accuracy and quantity. The Nemotron-CC pipeline addresses the limitations of traditional data curation methods, which often discard valuable content due to heuristic filtering that can't assess semantic quality. This results in suboptimal datasets and accuracy plateaus, particularly on complex reasoning tasks like the Multilingual Massive Language Understanding (MMLU) benchmark. By incorporating advanced techniques like classifier ensembling and synthetic data rephrasing, Nemotron-CC aims to produce more accurate and extensive datasets. Key Stages of the Nemotron-CC Pipeline HTML-to-Text Extraction and Filtering: Extraction: Utilizes the jusText library to parse HTML and extract text. Language Identification: Uses FastText to identify English language data and format Unicode characters. Deduplication: Employs both exact and fuzzy deduplication algorithms to eliminate duplicate and near-duplicate documents. The exact deduplication module uses hashing, while the fuzzy deduplication module leverages MinHash signatures and locality sensitive hashing (LSH) to detect documents with high similarity scores. Heuristic Filtering: Applies 28 distinct filters for non-alphanumeric content, numerical and URL ratios, whitespace inconsistencies, and n-gram analyses. These filters ensure that only high-quality and relevant data is retained. NVIDIA's RAPIDS libraries (cuDF, cuML, and cuGraph) and Dask are used to speed up processing, achieving up to 16 times faster text processing compared to alternative methods. Model-based Quality Labeling: Quality Classifiers: An ensemble of three quality classifier models—FastText Quality Classifier, NeMo Curator – FineWeb Mixtral Edu Classifier, and FineWeb Nemotron-4 Edu Classifier—are used to generate scores for each document. Scoring and Bucketing: These models produce floating-point scores that are converted to integer categories ranging from 0 (worst quality) to 19 (best quality). The scores are then combined to create a single representative score, categorizing documents into five quality levels. This process benefits from GPU acceleration, enhancing efficiency and performance. Synthetic Data Generation (SDG): Repurposing Low-Quality Documents: Low-quality documents are transformed into useful data by prompting an LLM to rewrite the text in a Wikipedia-like style. Expanding High-Quality Data: High-quality documents are rephrased or condensed using four different LLMs to generate diverse question-answer pairs, distill the text, extract key knowledge, and create organized lists. This expands the dataset with more pretraining tokens while maintaining high accuracy. Results and Impact When the Llama 3.1 8B-parameter model was trained on a 1T-token subset of the Nemotron-CC dataset, it showed a 5.6-point improvement in MMLU scores compared to the same model trained on the DCLM dataset. Training the model on a longer horizon of 15T tokens, including 7.2T tokens from Nemotron-CC, led to a 5-point boost on MMLU benchmarks, resulting in a score of 70.3 compared to the original Llama’s 65.3. These improvements highlight the effectiveness of the Nemotron-CC pipeline in enhancing the quality of pretraining datasets, which is crucial for advancing the capabilities of LLMs in various domain-specific applications such as energy, manufacturing, and chemistry. Developers can now leverage this pipeline to create custom datasets for both pretraining and fine-tuning, benefiting from the flexibility and scalability provided by NeMo Curator. Industry Evaluation The integration of the Nemotron-CC pipeline into NeMo Curator is seen as a pivotal development in the AI community. It addresses the growing need for high-quality, large datasets to train increasingly sophisticated AI models. Industry experts praise the pipeline for its innovative approach to data curation, particularly its ability to repurpose and enhance discarded content. This advancement is expected to democratize access to state-of-the-art data preparation tools, fostering innovation and accelerating progress in the field of AI. NVIDIA's NeMo Curator, already a trusted platform for developing AI models, now offers even more powerful capabilities. Developers and researchers are encouraged to explore the Nemotron-CC pipeline and contribute to the NeMo Curator repository to further refine and expand its functionalities. To get started, interested parties can visit the NeMo Curator GitHub repository, where they will find detailed documentation, tutorials, and opportunities to contribute to the ongoing development of this groundbreaking tool.
