HyperAI

A new study from MIT and the MIT-IBM Watson AI Lab introduces a comprehensive guide to improving the accuracy of scaling laws for large language models (LLMs), offering a systematic approach to predicting the performance of large models using smaller, more affordable ones. The research, presented at the International Conference on Machine Learning (ICML 2025), addresses a critical challenge in AI development: how to make informed decisions about model size, training data, and computational budget without incurring the high costs of full-scale training. Scaling laws have long been used to estimate how a large model will perform based on the behavior of smaller models within the same family. These laws typically relate model performance—measured by loss—to key variables like the number of parameters and training tokens. However, the field has lacked consistency, with researchers often building isolated scaling laws for individual models or model families, leading to unreliable or inconsistent predictions. To solve this, the MIT-IBM team compiled a massive dataset of 485 pre-trained LLMs across 40 model families, including Pythia, OPT, LLaMA, Bloom, GPT, and T5-Pile. The dataset includes training checkpoints, computational costs (FLOPs), training epochs, random seeds, and over 1.9 million performance metrics. Using this, they fitted more than 1,000 scaling laws and evaluated their predictive accuracy across different architectures, training regimes, and model sizes. A key metric used was absolute relative error (ARE), which measures the difference between a scaling law’s prediction and the actual loss of a fully trained large model. The team found that while perfect accuracy (0% ARE) is unattainable due to inherent randomness from training seeds, a 4% ARE is the practical limit. Even up to 20% ARE remains useful for making strategic decisions. The study revealed several practical insights. Including intermediate training checkpoints—rather than just final model losses—significantly improves prediction reliability. Early training data before 10 billion tokens is noisy and should be excluded. Training five models across a range of sizes provides a solid foundation for robust scaling law estimation. Surprisingly, partially training a target model to about 30% of its dataset can yield strong predictions, saving significant compute costs. The researchers also found that scaling laws from one model family can be adapted to others with similar architectures, though this approach is less reliable for encoder-decoder models. They discovered that just three of five key hyperparameters explain nearly all the variation in model behavior across families, suggesting a high degree of consistency in how models scale. An unexpected finding was that scaling laws can be used not only to predict large models from small ones but also to predict performance of smaller models from large ones—challenging the idea that small and large models behave fundamentally differently. Looking ahead, the team plans to extend their work to inference-time scaling laws—predicting how much computational effort a model needs at runtime to produce high-quality responses. As Andreas notes, this is increasingly important because models are not trained once and used forever; instead, each user query requires dynamic decision-making about how much “thinking” the model should do. Developing predictive models for inference time could become just as crucial as those for training. This research provides a much-needed framework for making scaling law estimation more reliable, efficient, and accessible—empowering researchers with limited resources to make better decisions in the rapidly evolving field of AI.

Related Links

Related Links

Related Links

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

Command Palette

MIT Researchers Develop Universal Guide to Accurately Predict LLM Performance Using Scaling Laws

Related Links

Command Palette

MIT Researchers Develop Universal Guide to Accurately Predict LLM Performance Using Scaling Laws

Related Links

Command Palette

MIT Researchers Develop Universal Guide to Accurately Predict LLM Performance Using Scaling Laws

Related Links

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.