21時間前

Jing Huang Daniel Wurgaft Rachit Bansal Laura Ruis Naomi Saphra David Alvarez-Melis Andrew Lampinen Christopher Potts Ekdeep Singh Lubana

概要

より大規模なモデルは、小規模モデルが学習できないタスクを学習する。この現象を駆動する要因は何か？我々は、べき乗則スケールリング（power-law scaling）の存在自体が、無限の訓練データを仮定した場合でも、大規模モデルが小規模モデルが学習に失敗するデータ分布の一部を学習できることを示唆する、単純な現象論的な議論を展開する。この主張を検証し、その原因を特定するために、単調なスケールリング曲線を示す混合タスクからなる合成セットアップにおけるモデルスケールの影響を調査する。結果は、データ起因のリソース（ニューロン）を巡る競争が鍵であることを示している。具体的には、小規模モデルはニューロンを周波数が高くまたは複雑度が低いタスクに割当て、その結果、希少かつ複雑なタスクに対して性能の悪い解を学習してしまう。さらに、所望のタスクを表現可能な解が存在する場合でも、この現象は発生する。次に、大規模モデルがどのようにこのデータ中心のボトルネックを回避するかを評価した結果、その原因は干渉メカニズムの低減にあることが分かった。大規模モデルは共通タスクに十分なリソースを割当てることができ、その結果、それらのタスクに関する勾配更新は弱くなる。これは、希少タスクの特性がゆっくりと蓄積される際に上書きされないことを意味する。最後に、これらの主張をさらに検証するために、異なる周波数と複雑度を持つ新規タスクに対して、OLMoモデル（4M〜4Bパラメータ）を事前訓練した。

One-sentence Summary

The authors develop a phenomenological argument validated through a synthetic task mixture and pretrained OLMo models ranging from 4M to 4B parameters, demonstrating that data-induced competition over neurons forces smaller models to prioritize high-frequency or low-complexity tasks whereas larger models circumvent this bottleneck through reduced interference that preserves rare-task features during gradient updates.

Key Contributions

This work develops a phenomenological argument demonstrating that power-law scaling allows larger models to learn portions of the data distribution inaccessible to smaller models, even given infinite training data. This theoretical framework posits that scaling inherently provides access to lower-order modes of the data distribution.
A synthetic setup consisting of a mixture of tasks reveals that smaller models allocate neurons to high-frequency tasks, leading to poor performance on rare and complex tasks due to data-induced resource competition. Results indicate larger models circumvent this bottleneck via a reduced interference mechanism where gradient updates for common tasks do not overwrite rare-task features.
The study validates these claims by pretraining OLMo models with parameters ranging from 4M to 4B on novel tasks of varying frequency and complexity. These experiments empirically support the claims regarding how scaling enables the learning of rare tasks through reduced interference.

Introduction

Modern machine learning relies on massive generalist models despite the high training and inference costs, yet the specific advantages of scaling parameters remain debated. Prior work often attributes performance gaps to sample efficiency or expressivity, implying smaller models could match larger ones with enough data. The authors argue that smaller models face a fundamental limitation where they fail to learn rare and complex tasks from a data mixture even with infinite training. They leverage a synthetic regression setup and pretrain OLMo models to validate that larger architectures reduce gradient interference between tasks. This mechanism allows larger models to retain features from infrequent data that smaller models overwrite due to resource competition. Their data-centric account explains the marginal benefits of scaling and informs practical decisions regarding model sizing and training data mixtures.

Dataset

Dataset Composition and Sources

The authors utilize Dolma v1.7 as the pre-training corpus, specifically selecting the first 50K batches totaling 210B tokens.
This data follows the exact token order used for OLMo-7B-0424 and OLMo-7B-0724 training runs.
Two special tasks are injected into the corpus to control task frequency: Comparison ( $T_{CMP}$ ) and Modular Addition ( $T_{ADD}$ ).

Key Details for Each Subset

Each task consists of 10K instances encoded as a three-token sequence (TOK1, TOK2, LABEL).
TOK1 and TOK2 are drawn from a set of 100 tokens randomly sampled from the vocabulary.
A bijective mapping assigns integer values from 0 to 99 to each token.
Comparison labels indicate if the first token value is less than the second.
Modular Addition labels represent the sum of both token values modulo 100.
Instances are split 50/50 for training and testing.

Model Usage and Training Mixture

OLMo models ranging from 4M to 4B parameters are trained on data mixtures with varying injection frequencies.
Task frequency is controlled between $7.8 \times 10^{-3}$ and $2.4 \times 10^{-8}$ , simulating ranges from 1K instances per batch to 1 instance every 10 batches.
Reference tasks ( $R_{cmp}$ and $R_{add}$ ) are sampled from pre-training data to ensure injected frequency matches natural task frequencies.
Performance is measured via training loss and test accuracy to distinguish between learning task distributions and memorization.

Processing and Injection Strategy

The injection process replaces the first four tokens of a training sequence with the task sequence plus an end of document token.
This replacement ensures the injected task frequency remains comparable to tasks learned during standard pre-training.
Feature geometry and task-relevant features are analyzed to verify scaling laws regarding model width and task frequency.

Method

The authors establish a multi-task learning framework to investigate how model capacity dictates the ability to learn tasks of varying frequency and complexity. They consider a mixture of $K$ linear regression tasks where the $k^{\text{th}}$ task appears with frequency $\pi_k$ and has a specific covariance structure $C_k$ . The student model employs a shared width- $N$ encoder $U \in \mathbb{R}^{d \times N}$ with orthonormal columns, paired with task-specific linear decoders $D_k$ . The prediction for task $k$ is given by $\hat{y}_k = D_k U^\top x$ , and the total loss is the weighted sum of the mean squared errors across all tasks.

Refer to the scaling regime diagram

The relationship between model size and loss is characterized by distinct scaling regimes. As illustrated in the scaling regime diagram, smaller models operating in the "Compute Optimal" regime may achieve low loss through data scaling, whereas larger models transition into a regime where "Learning requires model scaling." This transition highlights that increasing the parameter count ( $N$ ) is necessary to capture the lower-utility features associated with rarer tasks that smaller models fail to learn.

Theoretically, the authors derive that features are learned in order of their utility, defined as the product of task frequency and feature eigenvalue:

\nu_{k,j} = \pi_k \lambda_{k,j}

The optimal encoder for a width- $N$ model spans the top- $N$ eigenspace of the mixture covariance matrix $M = \sum_{k=1}^K \pi_k C_k$ . Consequently, a larger model retains features with lower utility, effectively allowing it to learn rarer or more complex tasks that are ignored by smaller models.

Refer to the alignment mechanism visualization

This selection process can be understood through the lens of gradient interference and feature alignment. In the geometric representation, the encoder attempts to align with task directions $T_f$ (frequent) and $T_r$ (rare). For a narrow model ( $N=1$ ), the encoder is pulled strongly toward the frequent task direction, causing the alignment with the rare task to degrade. As the width increases to $N=2$ , the model gains the capacity to span both directions simultaneously. The training dynamics plot confirms this behavior, showing that while frequent task observations pull the rare task alignment down, rare task observations push it up. Larger models stabilize this alignment, preventing the rare task from being overwritten by the dominant frequent tasks.

Experiment

Experiments on synthetic regression and realistic OLMo pretraining pipelines demonstrate that scaling model width reduces interference between frequent and rare tasks. Larger models retain rare task signals across observation gaps, while smaller models exhibit an update-and-forget dynamic where frequent updates overwrite rare features. Representational and gradient analysis confirms that increased capacity enables stable learning of low-frequency tasks without compromising common task performance.

ソースPDF

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

21時間前

Jing Huang Daniel Wurgaft Rachit Bansal Laura Ruis Naomi Saphra David Alvarez-Melis Andrew Lampinen Christopher Potts Ekdeep Singh Lubana

概要

One-sentence Summary

Key Contributions

This work develops a phenomenological argument demonstrating that power-law scaling allows larger models to learn portions of the data distribution inaccessible to smaller models, even given infinite training data. This theoretical framework posits that scaling inherently provides access to lower-order modes of the data distribution.
A synthetic setup consisting of a mixture of tasks reveals that smaller models allocate neurons to high-frequency tasks, leading to poor performance on rare and complex tasks due to data-induced resource competition. Results indicate larger models circumvent this bottleneck via a reduced interference mechanism where gradient updates for common tasks do not overwrite rare-task features.
The study validates these claims by pretraining OLMo models with parameters ranging from 4M to 4B on novel tasks of varying frequency and complexity. These experiments empirically support the claims regarding how scaling enables the learning of rare tasks through reduced interference.

Introduction

Dataset

Dataset Composition and Sources

The authors utilize Dolma v1.7 as the pre-training corpus, specifically selecting the first 50K batches totaling 210B tokens.
This data follows the exact token order used for OLMo-7B-0424 and OLMo-7B-0724 training runs.
Two special tasks are injected into the corpus to control task frequency: Comparison ( $T_{CMP}$ ) and Modular Addition ( $T_{ADD}$ ).

Key Details for Each Subset

Each task consists of 10K instances encoded as a three-token sequence (TOK1, TOK2, LABEL).
TOK1 and TOK2 are drawn from a set of 100 tokens randomly sampled from the vocabulary.
A bijective mapping assigns integer values from 0 to 99 to each token.
Comparison labels indicate if the first token value is less than the second.
Modular Addition labels represent the sum of both token values modulo 100.
Instances are split 50/50 for training and testing.

Model Usage and Training Mixture

OLMo models ranging from 4M to 4B parameters are trained on data mixtures with varying injection frequencies.
Task frequency is controlled between $7.8 \times 10^{-3}$ and $2.4 \times 10^{-8}$ , simulating ranges from 1K instances per batch to 1 instance every 10 batches.
Reference tasks ( $R_{cmp}$ and $R_{add}$ ) are sampled from pre-training data to ensure injected frequency matches natural task frequencies.
Performance is measured via training loss and test accuracy to distinguish between learning task distributions and memorization.

Processing and Injection Strategy

The injection process replaces the first four tokens of a training sequence with the task sequence plus an end of document token.
This replacement ensures the injected task frequency remains comparable to tasks learned during standard pre-training.
Feature geometry and task-relevant features are analyzed to verify scaling laws regarding model width and task frequency.

Method

Refer to the scaling regime diagram

Theoretically, the authors derive that features are learned in order of their utility, defined as the product of task frequency and feature eigenvalue:

\nu_{k,j} = \pi_k \lambda_{k,j}

Refer to the alignment mechanism visualization

Experiment

ソースPDF

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

Command Palette

なぜ大規模なモデルはより多くのことを学習するのか：容量、干渉、および稀なタスク保持の影響

Jing Huang Daniel Wurgaft Rachit Bansal Laura Ruis Naomi Saphra David Alvarez-Melis Andrew Lampinen Christopher Potts Ekdeep Singh Lubana

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

なぜ大規模なモデルはより多くのことを学習するのか：容量、干渉、および稀なタスク保持の影響

Jing Huang Daniel Wurgaft Rachit Bansal Laura Ruis Naomi Saphra David Alvarez-Melis Andrew Lampinen Christopher Potts Ekdeep Singh Lubana

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

なぜ大規模なモデルはより多くのことを学習するのか：容量、干渉、および稀なタスク保持の影響

Jing Huang Daniel Wurgaft Rachit Bansal Laura Ruis Naomi Saphra David Alvarez-Melis Andrew Lampinen Christopher Potts Ekdeep Singh Lubana

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters