21 hours ago

Jing Huang Daniel Wurgaft Rachit Bansal Laura Ruis Naomi Saphra David Alvarez-Melis Andrew Lampinen Christopher Potts Ekdeep Singh Lubana

Table of Contents

Abstract

Larger models learn tasks smaller models do not. What drives this phenomenon? We develop a simple phenomenological argument that power-law scaling already suggests that a larger model will be able to learn a part of the data distribution that a smaller model fails to learn, even with infinite training data. To validate this claim and identify its causes, we study the effects of model scaling on a synthetic setup consisting of a mixture of tasks that show monotonic scaling curves. The results point to a data-induced competition over resources (neurons). Specifically, smaller models allocate their neurons to high frequency or low complexity tasks, and so they learn solutions that perform poorly on rare and complex tasks. Moreover, this happens even when solutions capable of expressing the desired task exist. We then assess how a larger model circumvents this data-centric bottleneck, finding that it traces to a reduced interference mechanism: larger models can allocate enough resources to common tasks that the gradient updates for those tasks become weak, which means that they do not overwrite rare-task features as they slowly accumulate. Finally, to further validate these claims, we pretrain OLMo models (4M to 4B parameters) on novel tasks of varying frequency and complexity.

One-sentence Summary

The authors develop a phenomenological argument validated through a synthetic task mixture and pretrained OLMo models ranging from 4M to 4B parameters, demonstrating that data-induced competition over neurons forces smaller models to prioritize high-frequency or low-complexity tasks whereas larger models circumvent this bottleneck through reduced interference that preserves rare-task features during gradient updates.

Key Contributions

This work develops a phenomenological argument demonstrating that power-law scaling allows larger models to learn portions of the data distribution inaccessible to smaller models, even given infinite training data. This theoretical framework posits that scaling inherently provides access to lower-order modes of the data distribution.
A synthetic setup consisting of a mixture of tasks reveals that smaller models allocate neurons to high-frequency tasks, leading to poor performance on rare and complex tasks due to data-induced resource competition. Results indicate larger models circumvent this bottleneck via a reduced interference mechanism where gradient updates for common tasks do not overwrite rare-task features.
The study validates these claims by pretraining OLMo models with parameters ranging from 4M to 4B on novel tasks of varying frequency and complexity. These experiments empirically support the claims regarding how scaling enables the learning of rare tasks through reduced interference.

Introduction

Modern machine learning relies on massive generalist models despite the high training and inference costs, yet the specific advantages of scaling parameters remain debated. Prior work often attributes performance gaps to sample efficiency or expressivity, implying smaller models could match larger ones with enough data. The authors argue that smaller models face a fundamental limitation where they fail to learn rare and complex tasks from a data mixture even with infinite training. They leverage a synthetic regression setup and pretrain OLMo models to validate that larger architectures reduce gradient interference between tasks. This mechanism allows larger models to retain features from infrequent data that smaller models overwrite due to resource competition. Their data-centric account explains the marginal benefits of scaling and informs practical decisions regarding model sizing and training data mixtures.

Dataset

Dataset Composition and Sources

The authors utilize Dolma v1.7 as the pre-training corpus, specifically selecting the first 50K batches totaling 210B tokens.
This data follows the exact token order used for OLMo-7B-0424 and OLMo-7B-0724 training runs.
Two special tasks are injected into the corpus to control task frequency: Comparison ( $T_{CMP}$ ) and Modular Addition ( $T_{ADD}$ ).

Key Details for Each Subset

Each task consists of 10K instances encoded as a three-token sequence (TOK1, TOK2, LABEL).
TOK1 and TOK2 are drawn from a set of 100 tokens randomly sampled from the vocabulary.
A bijective mapping assigns integer values from 0 to 99 to each token.
Comparison labels indicate if the first token value is less than the second.
Modular Addition labels represent the sum of both token values modulo 100.
Instances are split 50/50 for training and testing.

Model Usage and Training Mixture

OLMo models ranging from 4M to 4B parameters are trained on data mixtures with varying injection frequencies.
Task frequency is controlled between $7.8 \times 10^{-3}$ and $2.4 \times 10^{-8}$ , simulating ranges from 1K instances per batch to 1 instance every 10 batches.
Reference tasks ( $R_{cmp}$ and $R_{add}$ ) are sampled from pre-training data to ensure injected frequency matches natural task frequencies.
Performance is measured via training loss and test accuracy to distinguish between learning task distributions and memorization.

Processing and Injection Strategy

The injection process replaces the first four tokens of a training sequence with the task sequence plus an end of document token.
This replacement ensures the injected task frequency remains comparable to tasks learned during standard pre-training.
Feature geometry and task-relevant features are analyzed to verify scaling laws regarding model width and task frequency.

Method

The authors establish a multi-task learning framework to investigate how model capacity dictates the ability to learn tasks of varying frequency and complexity. They consider a mixture of $K$ linear regression tasks where the $k^{\text{th}}$ task appears with frequency $\pi_k$ and has a specific covariance structure $C_k$ . The student model employs a shared width- $N$ encoder $U \in \mathbb{R}^{d \times N}$ with orthonormal columns, paired with task-specific linear decoders $D_k$ . The prediction for task $k$ is given by $\hat{y}_k = D_k U^\top x$ , and the total loss is the weighted sum of the mean squared errors across all tasks.

Refer to the scaling regime diagram

The relationship between model size and loss is characterized by distinct scaling regimes. As illustrated in the scaling regime diagram, smaller models operating in the "Compute Optimal" regime may achieve low loss through data scaling, whereas larger models transition into a regime where "Learning requires model scaling." This transition highlights that increasing the parameter count ( $N$ ) is necessary to capture the lower-utility features associated with rarer tasks that smaller models fail to learn.

Theoretically, the authors derive that features are learned in order of their utility, defined as the product of task frequency and feature eigenvalue:

\nu_{k,j} = \pi_k \lambda_{k,j}

The optimal encoder for a width- $N$ model spans the top- $N$ eigenspace of the mixture covariance matrix $M = \sum_{k=1}^K \pi_k C_k$ . Consequently, a larger model retains features with lower utility, effectively allowing it to learn rarer or more complex tasks that are ignored by smaller models.

Refer to the alignment mechanism visualization

This selection process can be understood through the lens of gradient interference and feature alignment. In the geometric representation, the encoder attempts to align with task directions $T_f$ (frequent) and $T_r$ (rare). For a narrow model ( $N=1$ ), the encoder is pulled strongly toward the frequent task direction, causing the alignment with the rare task to degrade. As the width increases to $N=2$ , the model gains the capacity to span both directions simultaneously. The training dynamics plot confirms this behavior, showing that while frequent task observations pull the rare task alignment down, rare task observations push it up. Larger models stabilize this alignment, preventing the rare task from being overwritten by the dominant frequent tasks.

Experiment

Experiments on synthetic regression and realistic OLMo pretraining pipelines demonstrate that scaling model width reduces interference between frequent and rare tasks. Larger models retain rare task signals across observation gaps, while smaller models exhibit an update-and-forget dynamic where frequent updates overwrite rare features. Representational and gradient analysis confirms that increased capacity enables stable learning of low-frequency tasks without compromising common task performance.

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

21 hours ago

Jing Huang Daniel Wurgaft Rachit Bansal Laura Ruis Naomi Saphra David Alvarez-Melis Andrew Lampinen Christopher Potts Ekdeep Singh Lubana

Table of Contents

Abstract

One-sentence Summary

Key Contributions

This work develops a phenomenological argument demonstrating that power-law scaling allows larger models to learn portions of the data distribution inaccessible to smaller models, even given infinite training data. This theoretical framework posits that scaling inherently provides access to lower-order modes of the data distribution.
A synthetic setup consisting of a mixture of tasks reveals that smaller models allocate neurons to high-frequency tasks, leading to poor performance on rare and complex tasks due to data-induced resource competition. Results indicate larger models circumvent this bottleneck via a reduced interference mechanism where gradient updates for common tasks do not overwrite rare-task features.
The study validates these claims by pretraining OLMo models with parameters ranging from 4M to 4B on novel tasks of varying frequency and complexity. These experiments empirically support the claims regarding how scaling enables the learning of rare tasks through reduced interference.

Introduction

Dataset

Dataset Composition and Sources

The authors utilize Dolma v1.7 as the pre-training corpus, specifically selecting the first 50K batches totaling 210B tokens.
This data follows the exact token order used for OLMo-7B-0424 and OLMo-7B-0724 training runs.
Two special tasks are injected into the corpus to control task frequency: Comparison ( $T_{CMP}$ ) and Modular Addition ( $T_{ADD}$ ).

Key Details for Each Subset

Each task consists of 10K instances encoded as a three-token sequence (TOK1, TOK2, LABEL).
TOK1 and TOK2 are drawn from a set of 100 tokens randomly sampled from the vocabulary.
A bijective mapping assigns integer values from 0 to 99 to each token.
Comparison labels indicate if the first token value is less than the second.
Modular Addition labels represent the sum of both token values modulo 100.
Instances are split 50/50 for training and testing.

Model Usage and Training Mixture

OLMo models ranging from 4M to 4B parameters are trained on data mixtures with varying injection frequencies.
Task frequency is controlled between $7.8 \times 10^{-3}$ and $2.4 \times 10^{-8}$ , simulating ranges from 1K instances per batch to 1 instance every 10 batches.
Reference tasks ( $R_{cmp}$ and $R_{add}$ ) are sampled from pre-training data to ensure injected frequency matches natural task frequencies.
Performance is measured via training loss and test accuracy to distinguish between learning task distributions and memorization.

Processing and Injection Strategy

The injection process replaces the first four tokens of a training sequence with the task sequence plus an end of document token.
This replacement ensures the injected task frequency remains comparable to tasks learned during standard pre-training.
Feature geometry and task-relevant features are analyzed to verify scaling laws regarding model width and task frequency.

Method

Refer to the scaling regime diagram

Theoretically, the authors derive that features are learned in order of their utility, defined as the product of task frequency and feature eigenvalue:

\nu_{k,j} = \pi_k \lambda_{k,j}

Refer to the alignment mechanism visualization

Experiment

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

Jing Huang Daniel Wurgaft Rachit Bansal Laura Ruis Naomi Saphra David Alvarez-Melis Andrew Lampinen Christopher Potts Ekdeep Singh Lubana

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

Jing Huang Daniel Wurgaft Rachit Bansal Laura Ruis Naomi Saphra David Alvarez-Melis Andrew Lampinen Christopher Potts Ekdeep Singh Lubana

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

Jing Huang Daniel Wurgaft Rachit Bansal Laura Ruis Naomi Saphra David Alvarez-Melis Andrew Lampinen Christopher Potts Ekdeep Singh Lubana

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters