HyperAIHyperAI

Command Palette

Search for a command to run...

DeepGen 1.0:画像生成および編集の進展を促進する軽量統合型マルチモーダルモデル

概要

画像生成および編集を統合する現在のマルチモーダルモデルは、通常、膨大なパラメータ規模(例:100億以上)に依存しており、トレーニングコストとデプロイ領域が非常に高くなるという課題を抱えています。本研究では、50億パラメータの軽量型統合モデル「DeepGen 1.0」を提案します。このモデルは、はるかに大きなモデルと比較しても同等またはそれを上回る包括的な性能を実現しています。コンパクトなモデルが抱える意味理解や細粒度制御における限界を克服するため、複数のVLM(視覚言語モデル)層から階層的な特徴を抽出し、学習可能な「思考トークン(think tokens)」と融合する「スタックドチャネルブリッジング(Stacked Channel Bridging, SCB)」という深層統合フレームワークを導入しました。これにより、生成基盤に構造的で推論豊富なガイダンスを提供します。さらに、三段階のデータ中心型トレーニング戦略を設計しました:(1)大規模な画像-テキストペアおよび編集トリプレットを用いたアライメント事前学習により、VLMとDiT(Diffusion Image Transformer)の表現を同期化;(2)生成、編集、推論の高品質な混合タスク上で共同監視微調整を実施し、包括的な能力を強化;(3)MR-GRPOを用いた強化学習により、複数の報酬関数と監督信号の混合を活用し、生成品質と人間の好みとの整合性を大幅に向上させつつ、安定したトレーニング進行を維持し、視覚的アーティファクトの発生を回避しました。わずか約5000万サンプルでの学習にもかかわらず、DeepGen 1.0は多様なベンチマークにおいて最先端の性能を達成し、WISEでは800億パラメータのHunyuanImageを28%上回り、UniREditBenchでは270億パラメータのQwen-Image-Editを37%上回りました。本研究では、トレーニングコード、モデル重み、データセットをオープンソース化することで、統合マルチモーダル研究の民主化を促進する効率的かつ高性能な代替手段を提供します。

One-sentence Summary

Researchers from Shanghai Innovation Institute, Fudan, USTC, and others propose DeepGen 1.0, a 5B lightweight multimodal model using Stacked Channel Bridging and staged training to outperform larger models in generation and editing, achieving state-of-the-art results while democratizing access via open-sourced code and datasets.

Key Contributions

  • DeepGen 1.0 introduces a lightweight 5B-parameter unified model for image generation and editing, challenging the assumption that large-scale models (>10B) are necessary for high performance, and achieves state-of-the-art results despite training on only ~50M samples.
  • It proposes Stacked Channel Bridging (SCB), a novel alignment framework that fuses hierarchical VLM features with learnable “think tokens” to deliver structured, reasoning-rich guidance to the DiT backbone, enhancing semantic understanding and fine-grained control without increasing model size.
  • Through a three-stage data-centric training pipeline—including alignment pre-training, joint supervised fine-tuning, and reinforcement learning with MR-GRPO—DeepGen 1.0 outperforms larger models like 80B HunyuanImage (by 28% on WISE) and 27B Qwen-Image-Edit (by 37% on UniREditBench), while avoiding visual artifacts and maintaining training stability.

Introduction

The authors leverage a unified VLM-DiT architecture to build DeepGen 1.0, a 5B-parameter model that handles image generation, editing, and reasoning tasks—challenging the assumption that only massive models (10B+) can deliver high-quality, semantically accurate visual outputs. Prior work relied on expensive, multi-model setups or large-scale training data, while smaller models failed to match performance due to weak cross-modal alignment and limited reasoning support. DeepGen 1.0 overcomes these limits with Stacked Channel Bridging (SCB), which fuses multi-layer VLM features and learnable “think tokens” to guide the DiT with structured, hierarchical semantics, plus a three-stage training pipeline that emphasizes data efficiency and human preference alignment via MR-GRPO. The result is a compact model that outperforms much larger counterparts—including 80B and 27B baselines—on reasoning and editing benchmarks, while being trained on just 50M samples and fully open-sourced for broader adoption.

Dataset

The authors use a diverse, multi-source training dataset combining real-world, synthetic, and curated open-source data to support general generation, editing, reasoning, text rendering, and application-specific tasks.

  • General Generation:

    • Sources: Text-to-image-2M, LAION-Aesthetic-6M, Megalith-10M, RedCaps-5M, CC-12M.
    • Instruction fine-tuning: BLIP-3o (60k), ShareGPT-4o-Image (45k), Echo-4o-Image (100k), OpenGPT4o-Image (40k), plus 10M in-house samples (3:1 long-to-short prompt ratio).
    • Augmented with ~50k synthetic photorealistic images via Nano Banana, paired with fine-grained prompts in both Chinese and English.
  • General Editing:

    • Sources: NHR-Edit (720k), GPT-Image-Edit (1.5M), ShareGPT-4o-Image-Edit (50k), OpenGPT4o-Image-Edit (40k), Nano-banana-consist (150k), Pico-Banana (250k), X2I2 (1.6M), Uniworld-Edit (1.2M), plus 1.1M in-house editing samples (Chinese and English).
  • Reasoning-based Generation and Editing:

    • Sources: UniReason (150k generation, 100k editing samples) covering cultural commonsense, natural science, spatial, temporal, and logical reasoning.
  • Text Rendering & Application Scenarios:

    • Sources: Multimodal QA datasets for captions; Gemini 2.5 Pro generates diverse rendering attributes (font, layout, color); Qwen-Image synthesizes 500k text-rendering images.
    • Extended with 60k application samples (e.g., Chinese poetry, poster design).
  • Processing & Usage:

    • Datasets are mixed per training stage as detailed in Appendix A, Table 8.
    • No explicit cropping strategy mentioned; metadata is constructed via prompt engineering and synthetic image generation pipelines.
    • All subsets are aligned to support multilingual (Chinese/English) and multimodal instruction following.

Method

The authors leverage a VLM-DiT architecture to unify multimodal understanding with high-fidelity image generation, as illustrated in the framework diagram. The system begins with a pretrained vision-language model (VLM), specifically Qwen-2.5-VL (3B), which processes interleaved visual and textual inputs—comprising system prompts, reference images, and user instructions—through its transformer blocks. To enhance reasoning, a fixed set of learnable “think tokens” is injected into the input sequence, enabling implicit Chain-of-Thought behavior across all VLM layers via self-attention. These tokens progressively summarize hidden representations, enriching the model’s ability to extract and retain knowledge.

Rather than relying on a single-layer VLM output, the authors introduce the Stacked Channel Bridging (SCB) framework to aggregate features from multiple layers. Six hidden states are uniformly sampled across low-, mid-, and high-level VLM blocks, preserving both fine-grained visual details and semantic abstractions. These selected states, including the think token representations, are stacked along the channel dimension and projected via a lightweight MLP to match the DiT’s input width. A Transformer encoder then fuses these multi-layer features into a robust conditional input cRL×dDiTc \in \mathbb{R}^{L \times d_{\text{DiT}}}cRL×dDiT, formalized as:

c=Encoder(MLP(Concatch(x1,,xn))).c = \operatorname{Encoder}(\operatorname{MLP}(\operatorname{Concat}_{\mathrm{ch}}(x_1, \ldots, x_n))).c=Encoder(MLP(Concatch(x1,,xn))).

This conditional signal is fed into the DiT decoder—initialized from SD3.5-Medium (2B)—which generates images through a sequence of DiT blocks conditioned on both the multimodal context and a noisy latent input. The DiT is further guided by a VAE encoder for latent space alignment and a noisy refiner for iterative refinement. The entire pipeline is connected via a streamlined connector module based on SigLIP and six transformer layers, maintaining a compact 5B parameter footprint.

Training proceeds in two main stages. In Stage 2, the authors perform joint supervised fine-tuning over 400,000 iterations on a diverse multi-task dataset encompassing general text-to-image generation, reasoning-based generation, text rendering, and image editing. To preserve the VLM’s pretrained capabilities, they apply LoRA for parameter-efficient adaptation. Images are processed at 512×512 resolution with dynamic aspect ratio preservation, and optimization uses a learning rate of 5×1055 \times 10^{-5}5×105 with 20,000 warm-up steps.

In Stage 3, reinforcement learning is applied via the MR-GRPO framework to align outputs with human preferences. The model samples a group of G=8G=8G=8 images per prompt using a noise-preserving stochastic sampler that maintains scheduler-consistent noise levels, ensuring stable reward signals. Each generated image is evaluated by three complementary reward functions: a VLM-based pairwise preference model for semantic and visual quality, an OCR reward for text rendering accuracy, and a CLIP similarity score for overall alignment. Rewards are normalized per group and aggregated with category-specific weights—prioritizing OCR for text-rendering prompts and preference rewards for general generation.

The policy optimization objective combines GRPO with an auxiliary supervised diffusion loss to prevent capability degradation:

Ltotal=(1λ)LGRPO+λLSFT,\mathcal{L}_{\mathrm{total}} = (1 - \lambda) \mathcal{L}_{\mathrm{GRPO}} + \lambda \, \mathcal{L}_{\mathrm{SFT}},Ltotal=(1λ)LGRPO+λLSFT,

where LGRPO\mathcal{L}_{\mathrm{GRPO}}LGRPO includes clipped advantage terms and KL regularization computed in velocity space:

DKL(πθπref)=v^θ(xt,t)v^ref(xt,t)2.D_{\mathrm{KL}}(\pi_{\theta} || \pi_{\mathrm{ref}}) = || \hat{v}_{\theta}(x_t, t) - \hat{v}_{\mathrm{ref}}(x_t, t) ||^2.DKL(πθ∣∣πref)=∣∣v^θ(xt,t)v^ref(xt,t)2.

Training runs for 1,500 steps with a learning rate of 2×1062 \times 10^{-6}2×106, using 50 denoising steps per sample and batch-wise advantage normalization to preserve multi-reward granularity.

Experiment

  • Alignment pre-training successfully bridges VLM and DiT using only connector and think tokens, enabling foundational text-to-image and editing capabilities without full model tuning.
  • DeepGen 1.0 excels across general generation and editing benchmarks, matching or surpassing larger models despite its compact 5B parameter size, demonstrating strong semantic alignment and instruction following.
  • The model achieves top-tier reasoning performance on world-knowledge tasks, outperforming open-source peers and narrowing gaps with closed-source systems across cultural, scientific, and logical domains.
  • Reasoning-based editing shows robustness in real and game-world scenarios, leading in key benchmarks and exceeding some closed-source models in overall performance.
  • Text rendering improves significantly with RL training, enhancing word accuracy and legibility while preserving semantic alignment, validating the effectiveness of the reinforcement learning framework.
  • Ablation studies confirm that stacked channel bridging, think tokens, and VLM activation are critical for performance, particularly in reasoning tasks, by enriching multimodal conditioning and knowledge distillation.
  • RL training stability and effectiveness rely on auxiliary SFT loss, KL regularization, and reward-wise advantage normalization, which collectively prevent capability drift and ensure balanced multi-objective optimization.

The authors evaluate DeepGen 1.0 on reasoning-based text-to-image generation using the T2I-CoREBench benchmark, which covers eight distinct reasoning categories. Results show that DeepGen 1.0, with only 5B parameters, matches or slightly surpasses larger open-source models across most reasoning types and achieves a competitive overall score, demonstrating broad reasoning capability despite its compact size.

The authors use a multi-objective reward framework to balance text rendering and general image generation, assigning higher preference weight to general T2I tasks while relying on OCR accuracy to specifically guide text synthesis. Results show that this approach prioritizes overall image quality during generation while still enabling precise text rendering through targeted signal alignment.

The authors evaluate DeepGen 1.0’s architectural components by ablating key elements and observe consistent performance drops across generation, editing, and reasoning benchmarks when removing stacked channel bridging, think tokens, or VLM activation. Results show that think tokens contribute most significantly to reasoning tasks, while stacked channel bridging and VLM activation support broader multimodal alignment. These findings confirm that each component plays a distinct and necessary role in maintaining the model’s overall capability.

The authors evaluate the impact of key reinforcement learning components in DeepGen 1.0 by ablating auxiliary SFT loss, KL regularization, and reward-wise advantage normalization. Results show that removing any of these components leads to measurable performance drops across generation and editing benchmarks, with the auxiliary SFT loss being especially critical for maintaining stability and preventing capability degradation during training. The full RL configuration consistently outperforms ablated variants, confirming that these components work synergistically to optimize multi-objective learning.

The authors evaluate DeepGen 1.0 on text rendering using the CVTG-2K benchmark, comparing it against both closed-source and open-source models. Results show that DeepGen 1.0, despite its compact 5B parameter size, achieves competitive CLIPScore and significantly improves Word Accuracy after RL training, outperforming several larger open-source models while maintaining strong semantic alignment.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています