HyperAIHyperAI

Command Palette

Search for a command to run...

InternVL-U: 理解、推論、生成、編集のための統合マルチモーダルモデルの民主化

概要

理解、推論、生成、編集を統合する統一型マルチモーダルモデル(UMM)は、高度な意味理解の維持と強力な生成能力の獲得との間で本質的なトレードオフに直面する。本報告では、これらの能力を統一フレームワーク内で民主化する軽量な 40 億パラメータの UMM「InternVL-U」を提案する。統一コンテキストモデリングの原則と、視覚表現をデカップリングしたモダリティ固有のモジュール設計に基づき、InternVL-U は最先端のマルチモーダル大規模言語モデル(MLLM)と、MMDiT ベースの専門的な視覚生成ヘッドを統合している。さらに、美的な生成と高次知能の間のギャップを埋めるため、推論中心のパラダイムのもと、Chain-of-Thought(CoT)を活用して抽象的なユーザー意図と微細な視覚生成の詳細をより適切に整合させる、テキストレンダリングや科学的推論などの高意味密度タスクを対象とした包括的なデータ合成パイプラインを構築した。広範な実験により、InternVL-U が性能と効率の優れたバランスを達成することが示された。40 億パラメータしか使用していないにもかかわらず、BAGEL(140 億パラメータ)など 3 倍以上の大規模な統一ベースラインモデルを上回る性能を、各種生成および編集タスクで一貫して発揮しつつ、強力なマルチモーダル理解および推論能力を維持している。

One-sentence Summary

Researchers from Shanghai AI Laboratory and multiple universities introduce InternVL-U, a lightweight 4B-parameter unified multimodal model that uniquely combines an MLLM with an MMDiT generation head. By leveraging a reasoning-centric data pipeline, it outperforms larger baselines in high-fidelity image generation and editing while maintaining strong semantic understanding.

Key Contributions

  • Unified multimodal models often struggle to balance strong semantic comprehension with powerful generation capabilities, creating a trade-off that limits their effectiveness in complex tasks.
  • InternVL-U addresses this by integrating a state-of-the-art Multimodal Large Language Model with a specialized MMDiT-based visual generation head, guided by a reasoning-centric data synthesis pipeline that leverages Chain-of-Thought to align abstract intent with fine-grained visual details.
  • Despite using only 4B parameters, the model consistently outperforms unified baselines with over three times the scale, such as BAGEL (14B), on various generation and editing tasks while retaining robust multimodal understanding and reasoning abilities.

Introduction

Unified multimodal models aim to integrate visual understanding, reasoning, generation, and editing within a single framework to advance toward Artificial General Intelligence, yet they struggle with inherent trade-offs between semantic comprehension and high-fidelity visual output. Prior approaches either require prohibitively expensive training from scratch or rely on fragmented pipelines that fail to align generation heads cleanly with the hidden states of large language models, often resulting in poor text rendering and weak logical consistency. The authors leverage a lightweight 4B-parameter architecture that combines a state-of-the-art Multimodal Large Language Model with a specialized MMDiT-based visual generation head to achieve superior efficiency and performance. They further introduce a comprehensive data synthesis pipeline driven by Chain-of-Thought reasoning to bridge the gap between abstract user intent and fine-grained visual details, enabling the model to outperform significantly larger baselines on complex tasks like scientific diagram generation and precise text editing.

Dataset

InternVL-U Dataset Overview

The authors construct a large-scale training corpus for InternVL-U by combining high-quality open-source datasets with specialized synthetic data pipelines. This approach targets diverse multimodal generation and editing tasks, with a specific focus on long-tail domains like human portraits, text-rich imagery, and scientific reasoning.

  • Dataset Composition and Sources

    • The initial data pool consists of publicly available image generation and editing datasets.
    • Specialized subsets are augmented to address long-tail cases in human portraits and text-rich scenarios.
    • Synthetic data is generated across five core domains: general, text-centric, science-centric, spatial-centric, and humor-centric.
  • Key Details for Each Subset

    • General Data: Includes diverse visual domains such as portraits, posters, and natural scenes. It utilizes a dual-branch expansion workflow combining retrieval-based searches for long-tail concepts and synthesis-based generation for manifold densification.
    • Text-Centric Data: Covers three types: semantically relevant text on natural images, text on solid-color backgrounds, and text editing within existing images (e.g., license plates, signboards).
    • Science-Centric Data: Spans physics, chemistry, biology, and computer science. Physics data uses an SVG-based pipeline for high-quality image pairs, while computer science data focuses on data structures like trees, graphs, and finite state machines.
    • Spatial-Centric Data: Derived from solid geometry (using GeoGebra), multi-view CAD (using the ABC dataset), and 3D object rotation (using Objaverse).
    • Humor-Centric Data: Synthesized from internet memes to train the model on abstract intent, sarcasm, and visual-textual contrast.
  • Data Usage and Processing Strategies

    • Preprocessing: The authors apply a rigorous multi-dimensional filtering protocol to exclude low-quality samples based on aesthetic scores, resolution, safety standards, and watermark detection. Near-duplicates are removed using perceptual hashing (p-hash).
    • Captioning: A pre-trained MLLM (Qwen2.5-VL) generates captions at varying granularities, including concise, dense, and human-centric descriptions.
    • Bilingual Support: An English-to-Chinese translation pipeline is applied across the dataset to ensure bilingual proficiency.
    • Reasoning Enhancement: A reasoning-centric module converts abstract user instructions into structured, actionable specifications using Chain-of-Thought (CoT) reasoning. This process enriches prompts with detailed visual descriptions, spatial relationships, and domain-specific constraints.
    • Synthesis Pipelines:
      • Physics: Uses PaddleOCR to extract images from documents, followed by an SVG-based generation pipeline to create input-output pairs, reducing costs significantly compared to raster editing.
      • Computer Science: Employs Python libraries (matplotlib, Graphviz) with fixed anchor points to ensure spatial consistency in data structure visualizations.
      • Spatial Rotation: Utilizes an "Object-First" strategy for context integration and a "Background-First" strategy for strict background preservation during object rotation.
  • Evaluation Benchmark

    • The authors introduce TextEdit, a human-curated benchmark for text-centric image editing.
    • It covers 18 sub-classes across virtual and real-world scenes.
    • Evaluation relies on manually annotated ground truth and a hybrid protocol combining OCR metrics, image fidelity measures, and multimodal LLM-based assessments.

Method

The authors propose InternVL-U, an efficient Unified Multimodal Model (UMM) designed to seamlessly integrate generative capabilities into a strong understanding backbone. The architecture is driven by three core design principles: Unified Contextual Modeling with Modality-Adaptive Generation, Structural Efficiency via Modality-Specific Modular Design, and Decoupled Visual Representations for Understanding and Generation.

Refer to the framework diagram for the high-level architectural design. The model addresses the dichotomy between multimodal understanding and generation by employing a unified autoregressive paradigm for contextualization while diverging for prediction targets. Text is modeled via a categorical distribution using cross-entropy loss, whereas visual signals are modeled in a continuous multivariate probability space using Flow Matching. To ensure structural efficiency, the model initializes its backbone with an encoder-based architecture (leveraging a pre-trained ViT) rather than a monolithic design, introducing an inductive bias that efficiently aggregates visual information. Furthermore, a dedicated generation head based on the Multimodal Diffusion Transformer (MMDiT) architecture is extended from the pre-trained MLLM. This hierarchical design allows the backbone to focus on semantic reasoning while the specialized stems and heads handle modality-specific translation. Crucially, the model adopts an asymmetric representation strategy: high-level semantic features from a ViT are used for understanding, while a separate Variational Autoencoder (VAE) compresses images into a latent space suitable for synthesis, avoiding the optimization trade-off between abstraction and pixel details.

The detailed architecture of the visual generation head is illustrated below.

The head employs Dual Projectors to map multimodal hidden states and VAE image latents into the conditioning space. To address scale mismatch, an additional normalization layer is introduced on the VLM branch. The core component is the Dual-Stream MMDiT Block, which utilizes a fully Dual-Stream architecture where context and target streams interact via joint self-attention but use disentangled parameters for QKVO projections and Feed-Forward Networks (FFNs). An element-wise Gating Mechanism is integrated into the attention block to enhance non-linearity and mitigate attention-sink phenomena. Additionally, the model employs Multimodal Scalable RoPE (MSRoPE) to encode positional information with unified 3D embeddings (temporal, height, width) for both generative targets and context visual tokens, ensuring rigorous preservation of spatial structures.

The training process is formulated as a joint optimization objective. For the textual component, the model minimizes the negative log-likelihood of target tokens using the standard Next-Token Prediction (NTP) objective. For the visual component, the Flow Matching framework with velocity parameterization is adopted to model the continuous distribution of image latents. The model regresses the velocity vector field that transports the probability density from a Gaussian noise distribution to the data distribution. The final training objective is a weighted sum of the discrete and continuous losses, with coefficients dynamically adjusted across different training stages.

A three-stage curriculum is designed to progressively unlock visual synthesis skills. In the first stage, Generation Head Pre-training, the MLLM is frozen while the generation head and projectors are trained on a mixture of text-to-image and image editing datasets. The second stage, Any-resolution Continued Pre-training, involves variable-resolution training (512 to 1024 pixels) with a frozen backbone to handle diverse aspect ratios. The final stage, Unified Supervised Finetuning, unfreezes the entire model to enable end-to-end optimization, mixing Chain-of-Thought reasoning data with image generation and editing data.

To support high-semantic-density tasks, comprehensive data synthesis pipelines are constructed. For image editing, a multi-agent framework generates instruction-edit pairs categorized into Global, Object, Attribute, and Compositional levels.

For text-to-image data, an automatic pipeline renders text on natural images and pure-color backgrounds with adaptive layout design.

For text-aware image editing, a three-stage pipeline employs OCR tools, MLLM-based instruction agents, and text-editing agents to generate high-quality paired samples.

During inference, Flow-DPM-Solver is adopted with 20 inference steps. Classifier-free guidance is used for both image and text conditions, with specific scales set for dropping the entire condition or text condition only.

Experiment

  • Multimodal understanding and reasoning benchmarks validate that the unified training strategy retains strong visual-language comprehension while achieving a superior balance between understanding and generation, matching larger models despite a compact architecture.
  • General image generation experiments confirm the model's ability to render intricate textures, nuanced lighting, and precise semantic alignment, outperforming other unified models with significantly fewer parameters.
  • Text-centric generation and editing evaluations demonstrate state-of-the-art capabilities in rendering legible multilingual text and accurately modifying specific text regions while preserving background integrity and visual aesthetics.
  • Knowledge-informed generation and reasoning-based editing tests show that integrating explicit reasoning steps significantly enhances the model's ability to execute complex logical constraints, scientific concepts, and multi-step instructions.
  • Qualitative results across all domains highlight the model's robust controllability, high visual fidelity, and effectiveness in handling diverse tasks ranging from humor-centric memes to specialized scientific diagrams.

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています