3달 전

Shaobo Wang Xuan Ouyang Tianyi Xu Yuzheng Hu Jialin Liu Guo Chen Tianyu Zhang Junhao Zheng Kexin Yang Xingzhang Ren

초록

고품질 공개 텍스트의 고갈이 점점 심화되면서 ‘데이터 벽(Data Wall)’이라는 현상이 나타나고 있으며, 이에 따라 사전 훈련(pre-training) 전략은 단순히 더 많은 토큰을 사용하는 방식에서 더 나은 토큰을 선택하는 방식으로 전환되고 있다. 그러나 기존의 방법들은 훈련 동역학을 고려하지 않는 히우리스틱적 정적 필터에 의존하거나, 원시적 기울기(raw gradients)에 기반한 동적이지만 최적화기(optimizer)에 무관한 기준을 사용하는 한계를 지닌다. 본 연구에서는 최적화기 유도(update) 공간에서 유효성(utility)을 정의하는 동적 데이터 선택 프레임워크인 OPUS(Optimizer-induced Projected Utility Selection)를 제안한다. OPUS는 현대적인 최적화기의 영향을 받은 효과적인 업데이트를, 안정적인 내분포(proxy)에서 유도된 목표 방향에 투영함으로써 후보 데이터의 점수를 산정한다. 확장성 확보를 위해 계산 효율성을 높이기 위해 CountSketch 기반의 Ghost 기법을 활용하고, 데이터 다양성을 유지하기 위해 Boltzmann 샘플링을 도입하였으며, 이로 인한 추가 계산 부담은 단 4.7%에 그친다. OPUS는 다양한 코퍼스, 품질 수준, 최적화기 및 모델 규모에서 뛰어난 성능을 보였다. FineWeb 및 FineWeb-Edu에서 총 300억 토큰 규모로 GPT-2 Large/XL을 사전 훈련할 경우, OPUS는 산업 수준의 기준 모델을 뛰어넘으며, 2000억 토큰 전체 훈련 결과조차도 초월하였다. 또한 산업 수준의 정적 필터와 결합할 경우, 저품질 데이터에서도 사전 훈련 효율이 더욱 향상되는 결과를 보였다. 더불어 SciencePedia를 기반으로 한 Qwen3-8B-Base 모델의 지속적 사전 훈련에서, OPUS는 30억 토큰 전체 훈련 대비 5억 토큰만으로도 우수한 성능을 달성하여 전문 분야에서 뛰어난 데이터 효율성의 가능성을 입증하였다.

One-sentence Summary

Researchers from EPIC Lab, Qwen Team, UW-Madison, UIUC, and Mila propose OPUS, a dynamic data selection method that projects optimizer-shaped updates onto stable directions, boosting efficiency 8× and accuracy 2.2% over random selection, especially effective for specialized domains like SciencePedia with minimal tokens.

Key Contributions

OPUS introduces an optimizer-aware utility metric for dynamic data selection, scoring tokens based on their projected impact in the actual update space of adaptive optimizers like AdamW and Muon, addressing the misalignment between gradient-based scoring and modern training dynamics.
The method employs a stable, in-distribution proxy (BENCH-PROXY) derived from the training corpus and scales efficiently via Ghost technique with CountSketch projections, adding only 4.7% compute overhead while preserving data diversity through Boltzmann sampling.
Empirically, OPUS outperforms industrial baselines in pre-training GPT-2 Large/XL on FineWeb datasets using 30B tokens (vs 200B) and achieves superior results in continued pre-training of Qwen3-8B-Base on SciencePedia with just 0.5B tokens (vs 3B), demonstrating significant data efficiency across model scales and domains.

Introduction

The authors leverage the growing scarcity of high-quality pre-training data to reframe data selection as an optimizer-aware, dynamic process rather than a static preprocessing step. Prior methods either rely on fixed quality heuristics that ignore model evolution or score samples in raw gradient space, misaligned with modern adaptive optimizers like AdamW and Muon that reshape update directions. OPUS introduces optimizer-induced utility—a principled, scalable framework that scores data based on how effectively each sample moves parameters under the actual optimizer geometry, using efficient projections and a stable in-distribution proxy. It further preserves diversity via Boltzmann sampling and achieves empirical gains over static filters and prior dynamic selectors across multiple LLMs and datasets.

Dataset

The authors construct BENCH-PROXY, a small proxy dataset sampled from the pre-training corpus to approximate the target benchmark’s distribution, enabling efficient gradient computation during training.
They compute benchmark relevance scores for each pre-training document using cosine similarity between embeddings from Arctic-Embed-L v2 (Yu et al., 2024a) — comparing each document to all benchmark validation samples and taking the max similarity per document.
The proxy set 𝒟_proxy is built by sorting documents by relevance score and greedily selecting top-scoring ones until reaching a 30M-token budget, ensuring compactness and distributional alignment.
During training, mini-batches are repeatedly sampled from 𝒟_proxy to estimate the gradient direction for within-step ranking, maintaining stable, low-variance scoring while steering toward benchmark-aligned data.

Method

The authors leverage OPUS, a dynamic data selection framework that operates within the optimizer-induced update geometry to prioritize training samples that maximally reduce validation loss under the actual trajectory of modern optimizers. Unlike prior methods that score candidates using raw gradients—implicitly assuming an SGD-like update space—OPUS explicitly accounts for the state-dependent preconditioning applied by optimizers such as AdamW and Muon. This is critical because modern optimizers reshape gradient directions via momentum, adaptive scaling, or matrix orthogonalization, thereby altering the effective update path. As shown in the figure below, OPUS aligns selection with the actual optimizer-induced path (green curve), avoiding the misalignment gap (red dashed arrow) that arises when using raw-gradient-based selection (blue curve) under non-SGD optimizers.

At each training step $t$ , OPUS receives a candidate buffer $\mathcal{B}_t$ and selects a subset $\widehat{\mathcal{B}}_t$ of size $K = \lfloor \rho N \rfloor$ to form the update batch. The selection is guided by a utility function derived from the expected reduction in validation loss after one optimizer step. Specifically, the marginal utility of adding a candidate $z$ to the current selected subset $\widehat{\mathcal{B}}_t$ is approximated as:

U_z^{(t)} \approx \eta_t \left\langle \mathbf{u}_z^{(t)}, \mathbf{g}_{\mathrm{proxy}}^{(t)} \right\rangle - \eta_t^2 \left\langle \mathbf{u}_z^{(t)}, \mathbf{G}^{(t)} \right\rangle,

where $\mathbf{u}_z^{(t)} = \mathbf{P}_t \nabla_\theta \mathcal{L}(z; \theta_t)$ is the optimizer-induced effective update for sample $z$ , $\mathbf{g}_{\mathrm{proxy}}^{(t)}$ is the proxy gradient estimated from a stable, in-distribution validation proxy pool $\mathcal{D}_{\mathrm{proxy}}$ , and $\mathbf{G}^{(t)} = \sum_{z_j \in \widehat{\mathcal{B}}_t} \mathbf{u}_{z_j}^{(t)}$ is the accumulated effective direction of already selected samples. The first term encourages alignment with the proxy target direction, while the second term penalizes redundancy by discouraging selection of samples whose updates are geometrically aligned with those already chosen.

To construct the proxy direction, OPUS employs BENCH-Proxy: a retrieval-based method that embeds both benchmark validation data and pre-training documents using a frozen text encoder, then selects the top- $M$ most similar pre-training documents to form $\mathcal{D}_{\mathrm{proxy}}$ . This ensures the proxy remains within the pre-training manifold while being aligned with downstream task distributions, yielding a stable and task-relevant gradient signal.

To scale this utility computation to LLM-sized models, OPUS avoids materializing full per-sample gradients by leveraging the ghost technique. For each linear layer $r$ , the per-sample gradient $\nabla_{\mathbf{W}_r} \mathcal{L}(z; \theta_t)$ is factorized into the outer product of input activation $\mathbf{a}_r^{(z)}$ and output gradient $\mathbf{b}_r^{(z)}$ . The optimizer-induced effective update $\mathbf{P}_{t,r} (\mathbf{a}_r^{(z)} \otimes \mathbf{b}_r^{(z)})$ is then projected into a low-dimensional space $\mathbb{R}^m$ using a CountSketch operator $\Pi_r$ , enabling efficient inner product computation without materializing the full gradient. For diagonal preconditioners (e.g., AdamW), this projection is interleaved with preconditioning, preserving computational efficiency at $\mathcal{O}(d_{\text{in}} + d_{\text{out}})$ per layer. For dense preconditioners (e.g., Muon), the cost increases to $\mathcal{O}(d_{\text{in}} d_{\text{out}})$ , but remains tractable due to the sketch dimension $m \ll d$ .

Finally, to maintain data diversity and avoid overfitting to transient proxy noise, OPUS replaces deterministic greedy selection with Boltzmann sampling. Each candidate $z$ is sampled with probability proportional to $\exp(U_z^{(t)} / \tau)$ , where $\tau > 0$ is a temperature hyperparameter. This stochastic selection ensures high-utility samples are favored while preserving non-zero probability for complementary candidates, enhancing robustness to estimation noise and non-stationarity in the data stream.

Refer to the framework diagram for a complete overview of the OPUS pipeline, which integrates proxy construction, efficient gradient projection, iterative utility estimation, and diversity-preserving sampling within a single training loop.

The entire process is executed iteratively: at each step, OPUS computes the optimizer-induced preconditioner $\mathbf{P}_t$ , generates per-layer sketches for both proxy and candidate samples, estimates marginal utilities in the projected space, samples the next batch via Boltzmann distribution, and updates the model using the selected subset. This ensures that every training step is informed by the optimizer’s actual geometry, the proxy’s task-relevant direction, and the diversity of the selected data.

Experiment

OPUS significantly improves pre-training efficiency, achieving 2.2% average accuracy gain across 10 benchmarks and 8x compute reduction versus random selection on GPT-XL using FineWeb.
It outperforms static and dynamic baselines even when selecting from lower-quality data (FineWeb-Edu score 3), matching or exceeding methods trained on higher-quality data (scores 4–5).
Performance gains hold under both AdamW and Muon optimizers, validating that aligning data selection with preconditioned update trajectories enhances training signal quality.
OPUS generalizes beyond proxy-aligned benchmarks, showing superior performance on out-of-distribution reasoning and comprehension tasks.
In continued pre-training on SciencePedia, OPUS reaches top performance with just 0.5B tokens—6x more data-efficient than random selection trained on 3B tokens—while improving across scientific domains.
Ablations confirm that stochastic sampling and benchmark-matched proxies are critical; greedy selection and default proxies underperform.
OPUS maintains minimal computational overhead (4.7% slowdown) via CountSketch projections, outperforming static methods that incur higher selection costs.
Qualitatively, OPUS selects more diverse, broadly useful samples compared to static methods that over-concentrate on narrow or high-loss patterns.

The authors use OPUS to dynamically select training data during pre-training, aligning selection with optimizer-specific update directions. Results show OPUS consistently outperforms both static filtering and other dynamic methods across multiple model sizes and optimizers, achieving higher average benchmark scores while maintaining computational efficiency. The method also demonstrates strong generalization and faster convergence, often matching or exceeding models trained with twice the compute budget.

The authors evaluate OPUS under varying hyperparameters including buffer size, sampling temperature, and projection dimension, finding that larger buffers and moderate temperatures yield the best performance, while the 8192-dimensional projection consistently delivers optimal results. Results show that OPUS consistently outperforms random selection across all configurations, confirming its robustness to hyperparameter changes. The method’s effectiveness is maintained even when adjusting key components, indicating stable and reliable gains in model performance.

The authors use OPUS to dynamically select training data aligned with optimizer-specific update directions, achieving higher average benchmark performance than random, greedy, or standard proxy-based selection. Results show that incorporating stochastic sampling and benchmark-matched proxies improves generalization beyond narrow optimization signals. OPUS consistently outperforms baselines across diverse reasoning and knowledge tasks under the same compute budget.

The authors use OPUS to dynamically select training data during pre-training, aligning selection with optimizer-specific update directions. Results show OPUS consistently outperforms static and dynamic baselines across model sizes and datasets, even when selecting from lower-quality data subsets. The method achieves stronger generalization and faster convergence while maintaining minimal computational overhead.

The authors use OPUS to dynamically select training data during pre-training, aligning selection with the optimizer’s update direction. Results show OPUS consistently outperforms static and dynamic baselines—even when selecting from lower-quality data—while achieving faster convergence and better generalization across benchmarks. The method also maintains efficiency, adding minimal computational overhead compared to random sampling.

소스 PDF

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

3달 전

Shaobo Wang Xuan Ouyang Tianyi Xu Yuzheng Hu Jialin Liu Guo Chen Tianyu Zhang Junhao Zheng Kexin Yang Xingzhang Ren

초록

One-sentence Summary

Key Contributions

OPUS introduces an optimizer-aware utility metric for dynamic data selection, scoring tokens based on their projected impact in the actual update space of adaptive optimizers like AdamW and Muon, addressing the misalignment between gradient-based scoring and modern training dynamics.
The method employs a stable, in-distribution proxy (BENCH-PROXY) derived from the training corpus and scales efficiently via Ghost technique with CountSketch projections, adding only 4.7% compute overhead while preserving data diversity through Boltzmann sampling.
Empirically, OPUS outperforms industrial baselines in pre-training GPT-2 Large/XL on FineWeb datasets using 30B tokens (vs 200B) and achieves superior results in continued pre-training of Qwen3-8B-Base on SciencePedia with just 0.5B tokens (vs 3B), demonstrating significant data efficiency across model scales and domains.

Introduction

Dataset

The authors construct BENCH-PROXY, a small proxy dataset sampled from the pre-training corpus to approximate the target benchmark’s distribution, enabling efficient gradient computation during training.
They compute benchmark relevance scores for each pre-training document using cosine similarity between embeddings from Arctic-Embed-L v2 (Yu et al., 2024a) — comparing each document to all benchmark validation samples and taking the max similarity per document.
The proxy set 𝒟_proxy is built by sorting documents by relevance score and greedily selecting top-scoring ones until reaching a 30M-token budget, ensuring compactness and distributional alignment.
During training, mini-batches are repeatedly sampled from 𝒟_proxy to estimate the gradient direction for within-step ranking, maintaining stable, low-variance scoring while steering toward benchmark-aligned data.

Method

U_z^{(t)} \approx \eta_t \left\langle \mathbf{u}_z^{(t)}, \mathbf{g}_{\mathrm{proxy}}^{(t)} \right\rangle - \eta_t^2 \left\langle \mathbf{u}_z^{(t)}, \mathbf{G}^{(t)} \right\rangle,

Experiment

OPUS significantly improves pre-training efficiency, achieving 2.2% average accuracy gain across 10 benchmarks and 8x compute reduction versus random selection on GPT-XL using FineWeb.
It outperforms static and dynamic baselines even when selecting from lower-quality data (FineWeb-Edu score 3), matching or exceeding methods trained on higher-quality data (scores 4–5).
Performance gains hold under both AdamW and Muon optimizers, validating that aligning data selection with preconditioned update trajectories enhances training signal quality.
OPUS generalizes beyond proxy-aligned benchmarks, showing superior performance on out-of-distribution reasoning and comprehension tasks.
In continued pre-training on SciencePedia, OPUS reaches top performance with just 0.5B tokens—6x more data-efficient than random selection trained on 3B tokens—while improving across scientific domains.
Ablations confirm that stochastic sampling and benchmark-matched proxies are critical; greedy selection and default proxies underperform.
OPUS maintains minimal computational overhead (4.7% slowdown) via CountSketch projections, outperforming static methods that incur higher selection costs.
Qualitatively, OPUS selects more diverse, broadly useful samples compared to static methods that over-concentrate on narrow or high-loss patterns.

The authors use OPUS to dynamically select training data during pre-training, aligning selection with optimizer-specific update directions. Results show OPUS consistently outperforms static and dynamic baselines across model sizes and datasets, even when selecting from lower-quality data subsets. The method achieves stronger generalization and faster convergence while maintaining minimal computational overhead.

소스 PDF

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

OPUS: 대규모 언어 모델 사전 훈련에서 각 반복 단계마다 효율적이고 원칙적인 데이터 선택을 위한 방향성

Shaobo Wang Xuan Ouyang Tianyi Xu Yuzheng Hu Jialin Liu Guo Chen Tianyu Zhang Junhao Zheng Kexin Yang Xingzhang Ren2 more

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

OPUS: 대규모 언어 모델 사전 훈련에서 각 반복 단계마다 효율적이고 원칙적인 데이터 선택을 위한 방향성

Shaobo Wang Xuan Ouyang Tianyi Xu Yuzheng Hu Jialin Liu Guo Chen Tianyu Zhang Junhao Zheng Kexin Yang Xingzhang Ren2 more

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

OPUS: 대규모 언어 모델 사전 훈련에서 각 반복 단계마다 효율적이고 원칙적인 데이터 선택을 위한 방향성

Shaobo Wang Xuan Ouyang Tianyi Xu Yuzheng Hu Jialin Liu Guo Chen Tianyu Zhang Junhao Zheng Kexin Yang Xingzhang Ren2 more

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Shaobo Wang Xuan Ouyang Tianyi Xu Yuzheng Hu Jialin Liu Guo Chen Tianyu Zhang Junhao Zheng Kexin Yang Xingzhang Ren

Shaobo Wang Xuan Ouyang Tianyi Xu Yuzheng Hu Jialin Liu Guo Chen Tianyu Zhang Junhao Zheng Kexin Yang Xingzhang Ren

Shaobo Wang Xuan Ouyang Tianyi Xu Yuzheng Hu Jialin Liu Guo Chen Tianyu Zhang Junhao Zheng Kexin Yang Xingzhang Ren