4달 전

Shayne Longpre Sneha Kudugunta Niklas Muennighoff I-Hung Hsu Isaac Caswell Alex Pentland Sercan Arik Chen-Yu Lee Sayna Ebrahimi

초록

스케일링 법칙 연구는 오랫동안 영어에 집중되어 왔으나, 가장 주목받는 인공지능 모델들은 실제로 수십억 명의 국제 사용자를 직접 서비스하고 있다. 본 연구에서는 지금까지 가장 규모가 큰 다국어 스케일링 법칙 연구를 수행하였으며, 총 774개의 다국어 학습 실험을 진행하여 모델 파라미터 수는 10M에서 8B까지, 학습 언어는 400개 이상, 평가 언어는 48개에 이르렀다. 우리는 단일 언어 및 다국어 사전학습 모두에 적용 가능한 적응형 전이 스케일링 법칙(Adaptive Transfer Scaling Law, ATLAS)을 제안하였으며, 기존 스케일링 법칙보다 샘플 외 일반화 성능을 평균 0.3 이상의 R² 향상으로 뛰어넘었다. 실험 분석을 통해 다국어 학습의 동역학, 언어 간 전이 특성, 그리고 다국어화가 초래하는 ‘역설적 과부하’(curse of multilinguality)에 대한 통찰을 도출하였다. 첫째, 38개 언어 간 쌍(38 × 38 = 1,444쌍)에 대한 상호 이익 점수를 실증적으로 측정하는 다국어 전이 행렬을 도출하였다. 둘째, 언어에 의존하지 않는 스케일링 법칙을 제시하여, 새로운 언어를 추가할 때 성능 훼손 없이 모델 크기와 데이터 양을 어떻게 최적화할 수 있는지를 밝혔다. 셋째, 새로운 사전학습을 시작할지, 다국어 체크포인트로부터 미세조정(finetune)할지를 결정하는 계산적 전환점(computational crossover points)을 규명하였다. 이러한 발견들이 언어 간 스케일링 법칙의 민주화를 위한 과학적 기반을 마련하고, 영어 중심의 AI에 그치지 않고, 다양한 언어 환경에서도 효율적으로 모델을 확장할 수 있도록 실무자들에게 도움이 되기를 기대한다.

One-sentence Summary

Researchers from MIT, Stanford, Google Cloud AI, and Google DeepMind introduce ATLAS, a novel scaling law for multilingual models that outperforms prior methods by 0.3+ R², enabling efficient cross-lingual scaling and transfer while addressing the curse of multilinguality for global AI democratization.

Key Contributions

We introduce ATLAS, a novel adaptive transfer scaling law for monolingual and multilingual pretraining that significantly improves out-of-sample generalization over prior scaling laws, often by more than 0.3 R², based on 774 experiments across 400+ languages and model sizes from 10M to 8B parameters.
We construct the first large-scale 38×38 cross-lingual transfer matrix, empirically quantifying mutual benefit and interference across 1444 language pairs, offering a foundational resource for understanding multilingual learning dynamics and transfer efficiency.
We derive practical scaling guidelines for the “curse of multilinguality” and identify computational crossover points between pretraining from scratch and finetuning from multilingual checkpoints, enabling efficient scaling decisions when expanding language coverage.

Introduction

The authors leverage large-scale multilingual experiments to address the lack of scaling laws beyond English, which matters because most real-world AI models serve global users but are optimized using English-centric frameworks. Prior work either focused narrowly on small models, limited language pairs, or ignored how adding languages degrades performance due to capacity constraints—known as the “curse of multilinguality.” Their main contribution is ATLAS, an adaptive transfer scaling law that improves out-of-sample generalization by over 0.3 R², alongside a 38x38 cross-lingual transfer matrix, a scaling law for managing multilingual capacity trade-offs, and a practical formula to decide when pretraining from scratch beats fine-tuning from multilingual checkpoints.

Dataset

The authors use the MADLAD-400 dataset — a CommonCrawl-derived corpus covering 400+ languages — as the primary pretraining source, with curated multilingual filtering and preprocessing. For evaluation, they select 50 languages that overlap between MADLAD-400 and Flores-200, chosen to span diverse families, scripts, and resource levels, using a stratified sampling method with manual adjustments to avoid typological redundancy.

Key subset details:

MADLAD-400 Test Set: For each of the 50 languages, they sample 20M tokens or 20% of total tokens (whichever is smaller), totaling ~20.5M tokens per language, to ensure stable loss measurements using the vocabulary-insensitive metric from Tao et al. (2024).
Flores-101: Used as a secondary test set; monolingual sequences are extracted from the parallel corpus to compute comparable sequence-level losses.

Training setup:

Models range from 10M to 8B parameters, using a 64K SentencePiece vocabulary.
They train 280 monolingual, 240 bilingual, and 120 multilingual models (including Unimax-style mixtures), plus 130 finetuned models — over 750 total runs.
Experiments vary model size (N), training tokens (D), and language mixtures (M), with core focus on English, French, Chinese, Hindi, Swahili, Russian, Spanish, Portuguese, and Japanese.
Training hyperparameters follow Kudugunta et al. (2024), with refinements based on recent work.

Evaluation strategy:

Scaling laws are evaluated via R² on held-out test sets, separately for four dimensions: largest model sizes (N), largest token counts (D), highest compute (C), and unseen language mixtures (M).
R² is averaged across languages: [EN, FR, RU, ZH, HI, SW] for monolingual, plus [ES, DE] for multilingual settings.
The vocabulary-insensitive loss ensures fair cross-lingual comparison across all test sets.

Method

The authors introduce the Adaptive Transfer Scaling Law (ATLAS), a unified framework designed to model scaling behavior in both monolingual and multilingual settings while accounting for data repetition and cross-lingual transfer effects. ATLAS builds upon the standard scaling law form, which expresses loss as a function of model size $N$ and effective data exposure $\mathcal{D}_{\text{eff}}$ , given by:

\mathcal{L}(N, \mathcal{D}_{\text{eff}}) = E + \frac{A}{N^{\alpha}} + \frac{B}{\mathcal{D}_{\text{eff}}^{\beta}}

The core innovation lies in the definition of $\mathcal{D}_{\text{eff}}$ , which decomposes total data exposure into three distinct components: a monolingual term for the target language, a transfer language term capturing cross-lingual influence, and a remainder term for all other languages. This decomposition is formalized as:

\mathcal{D}_{\text{eff}} = \underbrace{\mathcal{S}_{\lambda}(D_t; U_t)}_{\text{Monolingual}} + \underbrace{\sum_{i \in \mathcal{K}} \tau_i \, \mathcal{S}_{\lambda}(D_i; U_i)}_{\text{Transfer Languages}} + \underbrace{\tau_{\text{other}} \, \mathcal{S}_{\lambda}(D_{\text{other}}; U_{\text{other}})}_{\text{Other Languages}}

Each component uses a saturation function $\mathcal{S}_{\lambda}(D; U)$ to model the diminishing returns of data repetition beyond one epoch. This function is defined as:

\mathcal{S}_{\lambda}(D; U) = \begin{cases} D, & D \leq U \quad (\leq 1 \text{ epoch}) \\ U \left[1 + \frac{1 - \exp(-\lambda (D/U - 1))}{\lambda} \right], & D > U \quad (> 1 \text{ epoch}) \end{cases}

The parameter $\lambda$ governs the rate of saturation and is shared across all data sources, ensuring consistency in how repetition affects effective data exposure. The transfer weights $\tau_i$ and $\tau_{\text{other}}$ are learned parameters that quantify the relative contribution of each language group to the target language's performance. These weights are initialized using language transfer scores derived from prior analysis, enabling ATLAS to adaptively model cross-lingual transfer dynamics.

The framework is designed to be robust across varying data collection constraints and training regimes. By explicitly separating monolingual and transfer contributions, ATLAS avoids the limitations of prior approaches that either ignore data repetition or fail to distinguish between different types of cross-lingual effects. This modular design allows the model to generalize effectively to out-of-sample dimensions, including larger model sizes, extended data ranges, and unseen language mixtures. The inclusion of a single, shared repetition parameter $\lambda$ and a flexible transfer weight structure ensures that ATLAS remains computationally efficient while capturing complex scaling behaviors in multilingual settings.

Experiment

Constructed a 30×30 (and 38×38) cross-lingual transfer matrix using Bilingual Transfer Score (BTS), revealing that English is the top source language for 19/30 targets, followed by French (16/30), Spanish (13/30), and Hebrew (11/30); low-resource languages like Urdu and Pashto show consistent negative transfer.
Found strong correlation between positive transfer and shared script (mean BTS: -0.23) or language family (p < 0.001), with script having larger effect size than family; transfer is symmetric only within shared script/family pairs (e.g., French-Spanish), but asymmetric otherwise (e.g., Chinese-Farsi).
Quantified the “curse of multilinguality”: adding languages (K) increases target loss, mitigated more effectively by scaling model size (N) than data (D); scaling law L(K,N,D) = L∞ + A·K^φ/N^α + B·K^ψ/D^β fitted with φ=0.11, ψ=-0.04 (R²≥0.87), confirming mild capacity penalty offset by positive transfer.
Derived compute-optimal scaling: to expand from K to r·K languages without loss degradation, scale model size by r^(φ/α) ≈ r^0.11/α and total tokens by r^(1+ψ/β) ≈ r^(1-0.04/β); e.g., 4× languages requires 1.4× N and 2.74× D_tot, enabling 32% less data per language via transfer.
For pretrain vs finetune: finetuning from Unimax checkpoint outperforms scratch pretraining under 144B–283B tokens; beyond that, scratch pretraining wins; compute threshold modeled as log(C) = 1.1M × N^1.65, enabling practitioners to choose based on budget and model size.
Validated BTS scalability: transfer scores improve with larger models and stabilize early in training; larger models mitigate negative transfer (e.g., en→zh), turning interference toward neutrality, while synergistic pairs (e.g., es→pt) show modest gains.

The authors use bilingual transfer scores to measure how training on a source language affects the performance of a target language, with positive scores indicating beneficial transfer and negative scores indicating interference. The results show that language similarity, particularly shared language families and scripts, strongly predicts transfer effectiveness, with English being the most beneficial source language for many targets, while low-resource languages often exhibit negative transfer.

The authors use a comprehensive set of experiments to measure cross-lingual transfer and the impact of multilingual training on model performance. Results show that language similarity, particularly shared language family and script, strongly predicts positive transfer, with English being the most beneficial source language for many targets. Additionally, larger models mitigate negative interference and improve transfer efficiency, while the curse of multilinguality is partially offset by positive cross-lingual transfer.

The authors use bilingual transfer scores to measure how training on a source language affects the performance of a target language, finding that language similarity—particularly shared language family and script—strongly predicts positive transfer. Results show that pairs sharing both family and script exhibit the most symmetric and positive transfer, while those with differing family and script show greater asymmetry and negative transfer, with script similarity having a stronger effect than family.

The authors use a range of model scales from 10 million to 8 billion parameters, with varying numbers of attention heads, layers, embedding sizes, and feedforward widths, to study multilingual learning dynamics. The table provides the specific architectural configurations for each scale, which are used in experiments to measure language transfer and model capacity effects.

The authors compare scaling laws for monolingual and multilingual models, finding that their proposed multilingual scaling law achieves strong performance across different evaluation metrics. The [Ours] ATLAS (Dt + Dother + ∑κt Di) model outperforms existing baselines, particularly in multilingual settings, with high R² values indicating robust fit and generalization.

소스 PDF

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

4달 전

Shayne Longpre Sneha Kudugunta Niklas Muennighoff I-Hung Hsu Isaac Caswell Alex Pentland Sercan Arik Chen-Yu Lee Sayna Ebrahimi

초록

One-sentence Summary

Key Contributions

We introduce ATLAS, a novel adaptive transfer scaling law for monolingual and multilingual pretraining that significantly improves out-of-sample generalization over prior scaling laws, often by more than 0.3 R², based on 774 experiments across 400+ languages and model sizes from 10M to 8B parameters.
We construct the first large-scale 38×38 cross-lingual transfer matrix, empirically quantifying mutual benefit and interference across 1444 language pairs, offering a foundational resource for understanding multilingual learning dynamics and transfer efficiency.
We derive practical scaling guidelines for the “curse of multilinguality” and identify computational crossover points between pretraining from scratch and finetuning from multilingual checkpoints, enabling efficient scaling decisions when expanding language coverage.

Introduction

Dataset

Key subset details:

MADLAD-400 Test Set: For each of the 50 languages, they sample 20M tokens or 20% of total tokens (whichever is smaller), totaling ~20.5M tokens per language, to ensure stable loss measurements using the vocabulary-insensitive metric from Tao et al. (2024).
Flores-101: Used as a secondary test set; monolingual sequences are extracted from the parallel corpus to compute comparable sequence-level losses.

Training setup:

Models range from 10M to 8B parameters, using a 64K SentencePiece vocabulary.
They train 280 monolingual, 240 bilingual, and 120 multilingual models (including Unimax-style mixtures), plus 130 finetuned models — over 750 total runs.
Experiments vary model size (N), training tokens (D), and language mixtures (M), with core focus on English, French, Chinese, Hindi, Swahili, Russian, Spanish, Portuguese, and Japanese.
Training hyperparameters follow Kudugunta et al. (2024), with refinements based on recent work.

Evaluation strategy:

Scaling laws are evaluated via R² on held-out test sets, separately for four dimensions: largest model sizes (N), largest token counts (D), highest compute (C), and unseen language mixtures (M).
R² is averaged across languages: [EN, FR, RU, ZH, HI, SW] for monolingual, plus [ES, DE] for multilingual settings.
The vocabulary-insensitive loss ensures fair cross-lingual comparison across all test sets.

Method

\mathcal{L}(N, \mathcal{D}_{\text{eff}}) = E + \frac{A}{N^{\alpha}} + \frac{B}{\mathcal{D}_{\text{eff}}^{\beta}}

\mathcal{D}_{\text{eff}} = \underbrace{\mathcal{S}_{\lambda}(D_t; U_t)}_{\text{Monolingual}} + \underbrace{\sum_{i \in \mathcal{K}} \tau_i \, \mathcal{S}_{\lambda}(D_i; U_i)}_{\text{Transfer Languages}} + \underbrace{\tau_{\text{other}} \, \mathcal{S}_{\lambda}(D_{\text{other}}; U_{\text{other}})}_{\text{Other Languages}}

Each component uses a saturation function $\mathcal{S}_{\lambda}(D; U)$ to model the diminishing returns of data repetition beyond one epoch. This function is defined as:

\mathcal{S}_{\lambda}(D; U) = \begin{cases} D, & D \leq U \quad (\leq 1 \text{ epoch}) \\ U \left[1 + \frac{1 - \exp(-\lambda (D/U - 1))}{\lambda} \right], & D > U \quad (> 1 \text{ epoch}) \end{cases}

Experiment

Constructed a 30×30 (and 38×38) cross-lingual transfer matrix using Bilingual Transfer Score (BTS), revealing that English is the top source language for 19/30 targets, followed by French (16/30), Spanish (13/30), and Hebrew (11/30); low-resource languages like Urdu and Pashto show consistent negative transfer.
Found strong correlation between positive transfer and shared script (mean BTS: -0.23) or language family (p < 0.001), with script having larger effect size than family; transfer is symmetric only within shared script/family pairs (e.g., French-Spanish), but asymmetric otherwise (e.g., Chinese-Farsi).
Quantified the “curse of multilinguality”: adding languages (K) increases target loss, mitigated more effectively by scaling model size (N) than data (D); scaling law L(K,N,D) = L∞ + A·K^φ/N^α + B·K^ψ/D^β fitted with φ=0.11, ψ=-0.04 (R²≥0.87), confirming mild capacity penalty offset by positive transfer.
Derived compute-optimal scaling: to expand from K to r·K languages without loss degradation, scale model size by r^(φ/α) ≈ r^0.11/α and total tokens by r^(1+ψ/β) ≈ r^(1-0.04/β); e.g., 4× languages requires 1.4× N and 2.74× D_tot, enabling 32% less data per language via transfer.
For pretrain vs finetune: finetuning from Unimax checkpoint outperforms scratch pretraining under 144B–283B tokens; beyond that, scratch pretraining wins; compute threshold modeled as log(C) = 1.1M × N^1.65, enabling practitioners to choose based on budget and model size.
Validated BTS scalability: transfer scores improve with larger models and stabilize early in training; larger models mitigate negative transfer (e.g., en→zh), turning interference toward neutrality, while synergistic pairs (e.g., es→pt) show modest gains.

소스 PDF

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

ATLAS: 다국어 사전학습, 미세조정, 해독을 위한 적응형 전이 스케일링 법칙 – 다국어화의 저주를 넘어서

Shayne Longpre Sneha Kudugunta Niklas Muennighoff I-Hung Hsu Isaac Caswell Alex Pentland Sercan Arik Chen-Yu Lee Sayna Ebrahimi

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

ATLAS: 다국어 사전학습, 미세조정, 해독을 위한 적응형 전이 스케일링 법칙 – 다국어화의 저주를 넘어서

Shayne Longpre Sneha Kudugunta Niklas Muennighoff I-Hung Hsu Isaac Caswell Alex Pentland Sercan Arik Chen-Yu Lee Sayna Ebrahimi

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

ATLAS: 다국어 사전학습, 미세조정, 해독을 위한 적응형 전이 스케일링 법칙 – 다국어화의 저주를 넘어서

Shayne Longpre Sneha Kudugunta Niklas Muennighoff I-Hung Hsu Isaac Caswell Alex Pentland Sercan Arik Chen-Yu Lee Sayna Ebrahimi

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters