il y a 2 mois

Table des matières

Résumé

Les avancées récentes dans les modèles génératifs Texte-Image (T2I), tels qu’Imagen, Stable Diffusion et FLUX, ont permis des progrès remarquables en matière de qualité visuelle. Toutefois, leurs performances sont fondamentalement limitées par la qualité des données d’entraînement. Les jeux de données d’images issus de web-crawling ou générés synthétiquement contiennent souvent des échantillons de faible qualité ou redondants, ce qui entraîne une dégradation de la fidélité visuelle, une instabilité pendant l’entraînement et une inefficacité computationnelle. Par conséquent, une sélection efficace des données est cruciale pour améliorer l’efficacité des données. Les approches existantes reposent sur une curation manuelle coûteuse ou sur des scores heuristiques basés sur des caractéristiques à une dimension dans le filtrage des données Texte-Image. Bien que des méthodes fondées sur le meta-apprentissage aient été explorées dans les grands modèles linguistiques (LLM), aucune adaptation n’a été proposée pour les modalités visuelles. À cet effet, nous proposons Alchemist, un cadre basé sur les méta-gradient pour sélectionner un sous-ensemble pertinent à partir d’importants paires données texte-image. Notre approche apprend automatiquement à évaluer l’influence de chaque échantillon en optimisant itérativement le modèle du point de vue centré sur les données. Alchemist se compose de deux étapes clés : l’évaluation des données et la suppression des données. Nous entraînons un évaluateur léger pour estimer l’influence de chaque échantillon à partir d’informations de gradient, améliorée par une perception à plusieurs granularités. Nous utilisons ensuite la stratégie Shift-Gsampling pour sélectionner des sous-ensembles informatifs, favorisant un entraînement efficace du modèle. Alchemist est le premier cadre automatisé, évolutif et basé sur les méta-gradient pour la sélection de données dans l’entraînement des modèles Texte-Image. Des expériences menées sur des jeux de données synthétiques et issus de web-crawling démontrent que Alchemist améliore de manière cohérente la qualité visuelle et les performances en tâches ultérieures. L’entraînement sur 50 % des données sélectionnées par Alchemist permet de surpasser les performances obtenues en utilisant l’ensemble complet des données.

One-sentence Summary

Researchers from The University of Hong Kong, South China University of Technology, and Kuaishou Technology's Kling Team propose Alchemist, a meta-gradient-based framework for efficient Text-to-Image training that automatically selects high-impact data subsets. Unlike prior heuristic or manual methods, it employs a gradient-informed rater with multi-granularity perception and optimized sampling to identify informative samples, enabling models trained on just 50% of Alchemist-selected data to surpass full-dataset performance in visual fidelity and efficiency.

Key Contributions

Text-to-Image models like Stable Diffusion face performance bottlenecks due to low-quality or redundant samples in web-crawled training data, which degrade visual fidelity and cause unstable training; existing data selection methods rely on costly manual curation or single-dimensional heuristics that fail to optimize for downstream model performance.
Alchemist introduces a meta-gradient-based framework that automatically rates data samples using gradient-informed multi-granularity perception and employs a shift-Gaussian sampling strategy to prioritize mid-to-late scored samples, which exhibit more informative gradient dynamics and avoid overfitting from top-ranked plain samples.
Validated on synthetic and web-crawled datasets, Alchemist-selected subsets (e.g., 50% of data) consistently outperform full-dataset training in visual quality and model performance, with empirical evidence showing optimal data lies in mid-to-late score ranges that balance learnability and diversity.

Introduction

The authors address data selection for text-to-image (T2I) model training, where efficiently identifying high-quality text-image pairs from large datasets is critical for reducing computational costs and improving model performance. Prior approaches typically use Top-K pruning—retaining only the highest-rated samples—but this often causes rapid overfitting due to uninformative, low-gradient samples in the top tier, while ignoring more dynamically valuable mid-to-late range data. The authors demonstrate that top-ranked samples exhibit minimal gradient changes during training, contributing little to learning, whereas mid-to-late range samples drive effective model updates but are discarded by conventional methods. Their key contribution is the pruning-based shift-Gaussian sampling (Shift-Gsample) strategy: it first discards the top n% of samples to avoid overfitting, then applies Gaussian sampling centered in the mid-to-late percentile range to balance data informativeness and diversity. This approach selectively retains detailed yet learnable samples, filters out plain or chaotic data, and achieves superior performance by aligning with human intuition for robust T2I training.

Method

The authors leverage a meta-gradient-based framework called Alchemist to enable data-efficient training of Text-to-Image (T2I) models by automatically selecting high-value subsets from large-scale text-image pairs. The overall pipeline consists of two principal stages: data rating and data pruning, which together form a scalable, model-aware data curation system. Refer to the framework diagram for a high-level overview of the workflow.

In the data rating stage, a lightweight rater network parameterized by $\mu$ is trained to assign a continuous weight $W_{x_i}(\mu) \in [0,1]$ to each training sample $x_i$ . This weight reflects the sample’s influence on the downstream model’s validation performance. The rater is optimized via a bilevel formulation: the inner loop updates the proxy T2I model $\theta$ using a weighted loss over the training set, while the outer loop adjusts $\mu$ to minimize the validation loss. To avoid the computational burden of full inner-loop optimization, the authors adopt a meta-gradient approximation. During training, a reference proxy model $\hat{\theta}$ is warmed up using standard training data, while the primary model $\theta$ is updated using a combination of validation gradients and weighted training gradients:

\theta_{k+1} = \theta_k - \beta_k \left( g_{\mathrm{val}}(\theta_k) + g_{\mathrm{train}}(\theta_k, \mu_k) \right)

where $g_{\mathrm{train}}(\theta_k, \mu_k) = \sum_{x_i \in \mathcal{D}_{\mathrm{train}}} W_{x_i}(\mu_k) \nabla_\theta \mathcal{L}(\theta_k; x_i)$ . The rater’s parameters are then updated using an approximate gradient derived from the difference in loss between the primary and reference models:

\mu_{k+1} = \mu_k - \alpha_k \mathcal{L}(\theta_k; x_i) \nabla_\mu W_{x_i}(\mu_k)

To stabilize training, weights are normalized per batch via softmax:

W_{x_i} = \frac{\exp(\hat{W}_{x_i})}{\sum_j \exp(\hat{W}_{x_j})}

To account for batch-level variability and enhance robustness, the rater incorporates multi-granularity perception. It includes two parallel MLP modules: an Instance MLP that processes individual sample features and a Group MLP that computes a batch-level weight from pooled statistics (mean and variance) of the batch. The final weight for each sample is the product of its instance weight and batch weight, enabling the rater to capture both local distinctiveness and global context.

In the data pruning stage, the authors introduce the Shift-Gsample strategy to select a subset of the rated data. This strategy prioritizes samples from the middle-to-late region of the rating distribution—those that are neither too easy (low gradient impact) nor too hard (outliers or noisy)—but are sufficiently informative and learnable. As shown in the figure below, this approach outperforms random sampling, top-K selection, and block-based methods in terms of both sample count and downstream FID performance.

The selected dataset is then used to train the target T2I model, achieving comparable or superior performance with significantly fewer training samples—often as little as 50% of the original corpus—while accelerating convergence and improving visual fidelity.

Experiment

Alchemist data selection: 50% subset matched full dataset performance on MJHQ-30K and GenEval benchmarks, surpassing random sampling
20% Alchemist-selected data matched 50% random data performance, demonstrating significant data efficiency gains
Achieved 2.33× faster training at 20% retention and 5× faster at 50% retention while matching random sampling results
Consistently outperformed baselines across STAR (from-scratch) and FLUX-mini (LoRA fine-tuning) models
Generalized to HPDv3-2M and Flux-reason-6M datasets, surpassing random selection at 20% and 50% retention rates

The authors use a Shift-Gsample pruning strategy with a Group-MLP to select informative data, achieving the lowest FID and highest CLIP-Score among compared methods on 6M image-text pairs. Results show that incorporating group-level information further improves performance over sample-level selection alone.

The authors use Alchemist to select subsets of HPDv3-2M and Flux-reason-6M datasets, achieving lower FID and higher CLIP-Score than random sampling at both 20% and 50% retention. Results show that even with half the data, Alchemist-selected subsets outperform randomly sampled ones, confirming its effectiveness across diverse data domains.

The authors use Alchemist to select a 50% subset of the LAION dataset, achieving better FID and CLIP-Score than training on the full dataset while matching its training time. Results show that even a smaller 20% subset (Ours-small) trained in less than half the time still outperforms several heuristic-based selection methods on GenEval. Alchemist’s selected data consistently improves efficiency and performance compared to random sampling and other image quality metrics.

The authors use Alchemist to select training data for STAR and FLUX-mini models, showing consistent performance gains over random sampling across model scales and data sizes. Results show that using 6M Alchemist-selected images improves FID and CLIP-Score compared to both smaller and larger random subsets, and similar gains hold for FLUX-mini with 3B parameters. The method demonstrates scalability, as larger models and different architectures benefit from the same selected data without additional rater training.

PDF source

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

HyperAI

il y a 2 mois

Texte Vers Image

Modèle De Diffusion

Entraînement Du Modèle

Approche/Framework

Multimodal

Tâche

Kaixin Ding Yang Zhou Xi Chen Miao Yang Jiarong Ou Rui Chen Xin Tao Hengshuang Zhao

Table des matières

Résumé

One-sentence Summary

Key Contributions

Text-to-Image models like Stable Diffusion face performance bottlenecks due to low-quality or redundant samples in web-crawled training data, which degrade visual fidelity and cause unstable training; existing data selection methods rely on costly manual curation or single-dimensional heuristics that fail to optimize for downstream model performance.
Alchemist introduces a meta-gradient-based framework that automatically rates data samples using gradient-informed multi-granularity perception and employs a shift-Gaussian sampling strategy to prioritize mid-to-late scored samples, which exhibit more informative gradient dynamics and avoid overfitting from top-ranked plain samples.
Validated on synthetic and web-crawled datasets, Alchemist-selected subsets (e.g., 50% of data) consistently outperform full-dataset training in visual quality and model performance, with empirical evidence showing optimal data lies in mid-to-late score ranges that balance learnability and diversity.

Introduction

Method

\theta_{k+1} = \theta_k - \beta_k \left( g_{\mathrm{val}}(\theta_k) + g_{\mathrm{train}}(\theta_k, \mu_k) \right)

\mu_{k+1} = \mu_k - \alpha_k \mathcal{L}(\theta_k; x_i) \nabla_\mu W_{x_i}(\mu_k)

To stabilize training, weights are normalized per batch via softmax:

W_{x_i} = \frac{\exp(\hat{W}_{x_i})}{\sum_j \exp(\hat{W}_{x_j})}

Experiment

Alchemist data selection: 50% subset matched full dataset performance on MJHQ-30K and GenEval benchmarks, surpassing random sampling
20% Alchemist-selected data matched 50% random data performance, demonstrating significant data efficiency gains
Achieved 2.33× faster training at 20% retention and 5× faster at 50% retention while matching random sampling results
Consistently outperformed baselines across STAR (from-scratch) and FLUX-mini (LoRA fine-tuning) models
Generalized to HPDv3-2M and Flux-reason-6M datasets, surpassing random selection at 20% and 50% retention rates

PDF source

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

Command Palette

Alchimiste : Libérer l’efficacité de l’entraînement des modèles Texte-Image grâce à une sélection de données par méta-gradient

Kaixin Ding Yang Zhou Xi Chen Miao Yang Jiarong Ou Rui Chen Xin Tao Hengshuang Zhao

Résumé

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

Alchimiste : Libérer l’efficacité de l’entraînement des modèles Texte-Image grâce à une sélection de données par méta-gradient

Kaixin Ding Yang Zhou Xi Chen Miao Yang Jiarong Ou Rui Chen Xin Tao Hengshuang Zhao

Résumé

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

Alchimiste : Libérer l’efficacité de l’entraînement des modèles Texte-Image grâce à une sélection de données par méta-gradient

Kaixin Ding Yang Zhou Xi Chen Miao Yang Jiarong Ou Rui Chen Xin Tao Hengshuang Zhao

Résumé

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters