HyperAIHyperAI

Command Palette

Search for a command to run...

DeepGen 1.0 : un modèle multimodal léger et unifié pour faire progresser la génération et l'édition d'images

Résumé

Les modèles multimodaux unifiés actuels pour la génération et l’édition d’images reposent généralement sur des tailles de paramètres massives (par exemple, >10 milliards), entraînant des coûts d’entraînement prohibitifs et des empreintes de déploiement importantes. Dans ce travail, nous présentons DeepGen 1.0, un modèle léger unifié de 5 milliards de paramètres, capable d’atteindre des capacités complètes compétitives voire supérieures à celles de modèles bien plus grands. Pour surmonter les limites des modèles compacts en matière de compréhension sémantique et de contrôle fine, nous introduisons Stacked Channel Bridging (SCB), un cadre d’alignement profond qui extrait des caractéristiques hiérarchiques à partir de plusieurs couches de VLM (Vision-Language Models) et les fusionne avec des « tokens de réflexion » apprenables, offrant ainsi au noyau génératif une guidance structurée et riche en raisonnement. Nous avons également conçu une stratégie d’entraînement centrée sur les données, organisée en trois étapes progressives : (1) pré-entraînement d’alignement sur de vastes paires image-texte et triplets d’édition, afin de synchroniser les représentations du VLM et du DiT (Diffusion Image Transformer) ; (2) fine-tuning supervisé conjoint sur un mélange de haute qualité de tâches de génération, d’édition et de raisonnement, visant à développer des capacités omnies ; (3) apprentissage par renforcement avec MR-GRPO, qui exploite un mélange de fonctions de récompense et de signaux de supervision, conduisant à des gains substantiels en qualité de génération et en alignement avec les préférences humaines, tout en maintenant une progression stable de l’entraînement et en évitant les artefacts visuels. Malgré un entraînement sur seulement ~50 millions d’échantillons, DeepGen 1.0 atteint des performances de pointe sur divers benchmarks, dépassant HunyuanImage (80B) de 28 % sur WISE et Qwen-Image-Edit (27B) de 37 % sur UniREditBench. En rendant publics notre code d’entraînement, nos poids et nos jeux de données, nous offrons une alternative efficace et performante pour démocratiser la recherche multimodale unifiée.

One-sentence Summary

Researchers from Shanghai Innovation Institute, Fudan, USTC, and others propose DeepGen 1.0, a 5B lightweight multimodal model using Stacked Channel Bridging and staged training to outperform larger models in generation and editing, achieving state-of-the-art results while democratizing access via open-sourced code and datasets.

Key Contributions

  • DeepGen 1.0 introduces a lightweight 5B-parameter unified model for image generation and editing, challenging the assumption that large-scale models (>10B) are necessary for high performance, and achieves state-of-the-art results despite training on only ~50M samples.
  • It proposes Stacked Channel Bridging (SCB), a novel alignment framework that fuses hierarchical VLM features with learnable “think tokens” to deliver structured, reasoning-rich guidance to the DiT backbone, enhancing semantic understanding and fine-grained control without increasing model size.
  • Through a three-stage data-centric training pipeline—including alignment pre-training, joint supervised fine-tuning, and reinforcement learning with MR-GRPO—DeepGen 1.0 outperforms larger models like 80B HunyuanImage (by 28% on WISE) and 27B Qwen-Image-Edit (by 37% on UniREditBench), while avoiding visual artifacts and maintaining training stability.

Introduction

The authors leverage a unified VLM-DiT architecture to build DeepGen 1.0, a 5B-parameter model that handles image generation, editing, and reasoning tasks—challenging the assumption that only massive models (10B+) can deliver high-quality, semantically accurate visual outputs. Prior work relied on expensive, multi-model setups or large-scale training data, while smaller models failed to match performance due to weak cross-modal alignment and limited reasoning support. DeepGen 1.0 overcomes these limits with Stacked Channel Bridging (SCB), which fuses multi-layer VLM features and learnable “think tokens” to guide the DiT with structured, hierarchical semantics, plus a three-stage training pipeline that emphasizes data efficiency and human preference alignment via MR-GRPO. The result is a compact model that outperforms much larger counterparts—including 80B and 27B baselines—on reasoning and editing benchmarks, while being trained on just 50M samples and fully open-sourced for broader adoption.

Dataset

The authors use a diverse, multi-source training dataset combining real-world, synthetic, and curated open-source data to support general generation, editing, reasoning, text rendering, and application-specific tasks.

  • General Generation:

    • Sources: Text-to-image-2M, LAION-Aesthetic-6M, Megalith-10M, RedCaps-5M, CC-12M.
    • Instruction fine-tuning: BLIP-3o (60k), ShareGPT-4o-Image (45k), Echo-4o-Image (100k), OpenGPT4o-Image (40k), plus 10M in-house samples (3:1 long-to-short prompt ratio).
    • Augmented with ~50k synthetic photorealistic images via Nano Banana, paired with fine-grained prompts in both Chinese and English.
  • General Editing:

    • Sources: NHR-Edit (720k), GPT-Image-Edit (1.5M), ShareGPT-4o-Image-Edit (50k), OpenGPT4o-Image-Edit (40k), Nano-banana-consist (150k), Pico-Banana (250k), X2I2 (1.6M), Uniworld-Edit (1.2M), plus 1.1M in-house editing samples (Chinese and English).
  • Reasoning-based Generation and Editing:

    • Sources: UniReason (150k generation, 100k editing samples) covering cultural commonsense, natural science, spatial, temporal, and logical reasoning.
  • Text Rendering & Application Scenarios:

    • Sources: Multimodal QA datasets for captions; Gemini 2.5 Pro generates diverse rendering attributes (font, layout, color); Qwen-Image synthesizes 500k text-rendering images.
    • Extended with 60k application samples (e.g., Chinese poetry, poster design).
  • Processing & Usage:

    • Datasets are mixed per training stage as detailed in Appendix A, Table 8.
    • No explicit cropping strategy mentioned; metadata is constructed via prompt engineering and synthetic image generation pipelines.
    • All subsets are aligned to support multilingual (Chinese/English) and multimodal instruction following.

Method

The authors leverage a VLM-DiT architecture to unify multimodal understanding with high-fidelity image generation, as illustrated in the framework diagram. The system begins with a pretrained vision-language model (VLM), specifically Qwen-2.5-VL (3B), which processes interleaved visual and textual inputs—comprising system prompts, reference images, and user instructions—through its transformer blocks. To enhance reasoning, a fixed set of learnable “think tokens” is injected into the input sequence, enabling implicit Chain-of-Thought behavior across all VLM layers via self-attention. These tokens progressively summarize hidden representations, enriching the model’s ability to extract and retain knowledge.

Rather than relying on a single-layer VLM output, the authors introduce the Stacked Channel Bridging (SCB) framework to aggregate features from multiple layers. Six hidden states are uniformly sampled across low-, mid-, and high-level VLM blocks, preserving both fine-grained visual details and semantic abstractions. These selected states, including the think token representations, are stacked along the channel dimension and projected via a lightweight MLP to match the DiT’s input width. A Transformer encoder then fuses these multi-layer features into a robust conditional input cRL×dDiTc \in \mathbb{R}^{L \times d_{\text{DiT}}}cRL×dDiT, formalized as:

c=Encoder(MLP(Concatch(x1,,xn))).c = \operatorname{Encoder}(\operatorname{MLP}(\operatorname{Concat}_{\mathrm{ch}}(x_1, \ldots, x_n))).c=Encoder(MLP(Concatch(x1,,xn))).

This conditional signal is fed into the DiT decoder—initialized from SD3.5-Medium (2B)—which generates images through a sequence of DiT blocks conditioned on both the multimodal context and a noisy latent input. The DiT is further guided by a VAE encoder for latent space alignment and a noisy refiner for iterative refinement. The entire pipeline is connected via a streamlined connector module based on SigLIP and six transformer layers, maintaining a compact 5B parameter footprint.

Training proceeds in two main stages. In Stage 2, the authors perform joint supervised fine-tuning over 400,000 iterations on a diverse multi-task dataset encompassing general text-to-image generation, reasoning-based generation, text rendering, and image editing. To preserve the VLM’s pretrained capabilities, they apply LoRA for parameter-efficient adaptation. Images are processed at 512×512 resolution with dynamic aspect ratio preservation, and optimization uses a learning rate of 5×1055 \times 10^{-5}5×105 with 20,000 warm-up steps.

In Stage 3, reinforcement learning is applied via the MR-GRPO framework to align outputs with human preferences. The model samples a group of G=8G=8G=8 images per prompt using a noise-preserving stochastic sampler that maintains scheduler-consistent noise levels, ensuring stable reward signals. Each generated image is evaluated by three complementary reward functions: a VLM-based pairwise preference model for semantic and visual quality, an OCR reward for text rendering accuracy, and a CLIP similarity score for overall alignment. Rewards are normalized per group and aggregated with category-specific weights—prioritizing OCR for text-rendering prompts and preference rewards for general generation.

The policy optimization objective combines GRPO with an auxiliary supervised diffusion loss to prevent capability degradation:

Ltotal=(1λ)LGRPO+λLSFT,\mathcal{L}_{\mathrm{total}} = (1 - \lambda) \mathcal{L}_{\mathrm{GRPO}} + \lambda \, \mathcal{L}_{\mathrm{SFT}},Ltotal=(1λ)LGRPO+λLSFT,

where LGRPO\mathcal{L}_{\mathrm{GRPO}}LGRPO includes clipped advantage terms and KL regularization computed in velocity space:

DKL(πθπref)=v^θ(xt,t)v^ref(xt,t)2.D_{\mathrm{KL}}(\pi_{\theta} || \pi_{\mathrm{ref}}) = || \hat{v}_{\theta}(x_t, t) - \hat{v}_{\mathrm{ref}}(x_t, t) ||^2.DKL(πθ∣∣πref)=∣∣v^θ(xt,t)v^ref(xt,t)2.

Training runs for 1,500 steps with a learning rate of 2×1062 \times 10^{-6}2×106, using 50 denoising steps per sample and batch-wise advantage normalization to preserve multi-reward granularity.

Experiment

  • Alignment pre-training successfully bridges VLM and DiT using only connector and think tokens, enabling foundational text-to-image and editing capabilities without full model tuning.
  • DeepGen 1.0 excels across general generation and editing benchmarks, matching or surpassing larger models despite its compact 5B parameter size, demonstrating strong semantic alignment and instruction following.
  • The model achieves top-tier reasoning performance on world-knowledge tasks, outperforming open-source peers and narrowing gaps with closed-source systems across cultural, scientific, and logical domains.
  • Reasoning-based editing shows robustness in real and game-world scenarios, leading in key benchmarks and exceeding some closed-source models in overall performance.
  • Text rendering improves significantly with RL training, enhancing word accuracy and legibility while preserving semantic alignment, validating the effectiveness of the reinforcement learning framework.
  • Ablation studies confirm that stacked channel bridging, think tokens, and VLM activation are critical for performance, particularly in reasoning tasks, by enriching multimodal conditioning and knowledge distillation.
  • RL training stability and effectiveness rely on auxiliary SFT loss, KL regularization, and reward-wise advantage normalization, which collectively prevent capability drift and ensure balanced multi-objective optimization.

The authors evaluate DeepGen 1.0 on reasoning-based text-to-image generation using the T2I-CoREBench benchmark, which covers eight distinct reasoning categories. Results show that DeepGen 1.0, with only 5B parameters, matches or slightly surpasses larger open-source models across most reasoning types and achieves a competitive overall score, demonstrating broad reasoning capability despite its compact size.

The authors use a multi-objective reward framework to balance text rendering and general image generation, assigning higher preference weight to general T2I tasks while relying on OCR accuracy to specifically guide text synthesis. Results show that this approach prioritizes overall image quality during generation while still enabling precise text rendering through targeted signal alignment.

The authors evaluate DeepGen 1.0’s architectural components by ablating key elements and observe consistent performance drops across generation, editing, and reasoning benchmarks when removing stacked channel bridging, think tokens, or VLM activation. Results show that think tokens contribute most significantly to reasoning tasks, while stacked channel bridging and VLM activation support broader multimodal alignment. These findings confirm that each component plays a distinct and necessary role in maintaining the model’s overall capability.

The authors evaluate the impact of key reinforcement learning components in DeepGen 1.0 by ablating auxiliary SFT loss, KL regularization, and reward-wise advantage normalization. Results show that removing any of these components leads to measurable performance drops across generation and editing benchmarks, with the auxiliary SFT loss being especially critical for maintaining stability and preventing capability degradation during training. The full RL configuration consistently outperforms ablated variants, confirming that these components work synergistically to optimize multi-objective learning.

The authors evaluate DeepGen 1.0 on text rendering using the CVTG-2K benchmark, comparing it against both closed-source and open-source models. Results show that DeepGen 1.0, despite its compact 5B parameter size, achieves competitive CLIPScore and significantly improves Word Accuracy after RL training, outperforming several larger open-source models while maintaining strong semantic alignment.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp