HyperAI

Datology AI has unveiled BeyondWeb, a new framework that uses synthetic data to overcome the growing limitations in AI training data. As training budgets for large language models now span trillions of tokens, high-quality web data has become increasingly scarce. Datology AI identifies this scarcity as a critical bottleneck—what it calls the "wall of data"—and positions BeyondWeb as a solution designed to generate more efficient, information-rich training material. BeyondWeb works by restructuring existing web documents to be more concise, information-dense, and educationally focused. It reorganizes content to improve coherence and aligns the tone with instructional formats, making it better suited for training language models. The framework not only enhances data quality but also accelerates training. In benchmark tests, BeyondWeb delivered notable performance gains. On 8B parameter models, it improved accuracy by 5.1 percentage points over Hugging Face’s Cosmopedia and by 2.6 points over Nvidia’s Nemotron-CC. These results were measured across 14 standard benchmarks in both zero-shot and five-shot settings. Training efficiency also saw a major boost. BeyondWeb trained models 7.7 times faster than standard open web data and 2.7 times faster than Nemotron Synthetic. Remarkably, a 3B parameter model trained on BeyondWeb outperformed an 8B model trained on Cosmopedia when using the same number of tokens. After just 66 billion tokens, BeyondWeb achieved around 64% accuracy—7.7 times better than RedPajama and 2.7 times better than Nemotron-Synth. The research explored seven key aspects of synthetic data generation. One major insight: diversity is essential. While standard synthetic methods offer short-term gains, their limited stylistic variety leads to diminishing returns over time. Another finding: conversational content, which is crucial for real-world LLM applications, makes up less than 2.7% of typical web data. Adding more conversational examples helps, but benefits plateau quickly. Surprisingly, smaller models can be highly effective at generating high-quality synthetic data. The study found that moving from a 1B to a 3B parameter model improved data quality by 1.5 percentage points, but further increases at 8B offered little additional gain. This suggests that even organizations with limited computational resources can produce strong synthetic datasets. The team also tested multiple model families as reformulators and discovered that overall benchmark performance did not predict synthetic data quality. This indicates that a model’s ability to generate useful synthetic data depends on factors beyond raw size or general capability. BeyondWeb has already been used to train ArceeAI’s 4.5B parameter AFM model. Datology AI built a scalable pipeline capable of processing trillions of tokens. Despite its success, the framework is not currently available for free research use. Other major players have also advanced synthetic data techniques. Microsoft’s Phi-4, released in December 2024, was trained on 400 billion synthetic tokens with textbook-style content and specialized "pivotal tokens" to enhance learning. Nvidia launched Nemotron-4 340B, an open-source pipeline where 98% of the training data was synthetic. Researchers also challenged the idea of "model collapse," showing synthetic data can drive meaningful progress when properly designed. OpenAI confirmed during the GPT-5 announcement that the model was trained on synthetic data, likely generated by its internal o3 model. While many companies use synthetic data to reduce costs, OpenAI emphasized a focus on data quality and meaningful learning. Sébastien Bubeck, formerly leading Microsoft’s Phi project, highlighted this approach as key to long-term AI advancement.

Datology AI Unveils BeyondWeb: Synthetic Data Framework Boosts LLM Training Efficiency and Accuracy

Related Links