Command Palette
Search for a command to run...
InternVL-U: Demokratisierung einheitlicher multimodaler Modelle für Verständnis, Schlussfolgerung, Generierung und Bearbeitung
InternVL-U: Demokratisierung einheitlicher multimodaler Modelle für Verständnis, Schlussfolgerung, Generierung und Bearbeitung
Zusammenfassung
Einheitliche multimodale Modelle (UMMs), die Verständnis, Schlussfolgerung, Generierung und Bearbeitung integrieren, sehen sich inhärenten Zielkonflikten zwischen der Aufrechterhaltung einer starken semantischen Erfassung und dem Erwerb leistungsfähiger Generierungsfähigkeiten gegenüber. In diesem Bericht stellen wir InternVL-U vor, ein leichtgewichtiges UMM mit 4 Milliarden Parametern, das diese Fähigkeiten innerhalb eines einheitlichen Rahmens demokratisiert. Angeleitet von den Prinzipien der einheitlichen kontextuellen Modellierung und eines modalspezifischen modularen Designs mit entkoppelten visuellen Repräsentationen integriert InternVL-U ein state-of-the-art Multimodales Großes Sprachmodell (MLLM) mit einem spezialisierten, auf MMDiT basierenden visuellen Generierungs-Head. Um die Lücke zwischen ästhetischer Generierung und höherer Intelligenz weiter zu schließen, haben wir eine umfassende Pipeline zur Datensynthese entwickelt, die auf Aufgaben mit hoher semantischer Dichte abzielt, wie Textrendering und wissenschaftliches Schlussfolgern. Diese Pipeline folgt einem schlussfolgerungszentrierten Paradigma, das Chain-of-Thought (CoT) nutzt, um abstrakte Benutzerabsichten besser mit feingranularen Details der visuellen Generierung in Einklang zu bringen. Zahlreiche Experimente belegen, dass InternVL-U ein überlegenes Gleichgewicht zwischen Leistung und Effizienz erreicht. Trotz der Verwendung von nur 4 Milliarden Parametern übertrifft es auf verschiedenen Generierungs- und Bearbeitungsaufgaben konsistent einheitliche Baseline-Modelle mit mehr als dreifach größerem Maßstab, wie beispielsweise BAGEL (14 Milliarden Parameter), und behält gleichzeitig starke multimodale Verständnis- und Schlussfolgerungsfähigkeiten bei.
One-sentence Summary
Researchers from Shanghai AI Laboratory and multiple universities introduce InternVL-U, a lightweight 4B-parameter unified multimodal model that uniquely combines an MLLM with an MMDiT generation head. By leveraging a reasoning-centric data pipeline, it outperforms larger baselines in high-fidelity image generation and editing while maintaining strong semantic understanding.
Key Contributions
- Unified multimodal models often struggle to balance strong semantic comprehension with powerful generation capabilities, creating a trade-off that limits their effectiveness in complex tasks.
- InternVL-U addresses this by integrating a state-of-the-art Multimodal Large Language Model with a specialized MMDiT-based visual generation head, guided by a reasoning-centric data synthesis pipeline that leverages Chain-of-Thought to align abstract intent with fine-grained visual details.
- Despite using only 4B parameters, the model consistently outperforms unified baselines with over three times the scale, such as BAGEL (14B), on various generation and editing tasks while retaining robust multimodal understanding and reasoning abilities.
Introduction
Unified multimodal models aim to integrate visual understanding, reasoning, generation, and editing within a single framework to advance toward Artificial General Intelligence, yet they struggle with inherent trade-offs between semantic comprehension and high-fidelity visual output. Prior approaches either require prohibitively expensive training from scratch or rely on fragmented pipelines that fail to align generation heads cleanly with the hidden states of large language models, often resulting in poor text rendering and weak logical consistency. The authors leverage a lightweight 4B-parameter architecture that combines a state-of-the-art Multimodal Large Language Model with a specialized MMDiT-based visual generation head to achieve superior efficiency and performance. They further introduce a comprehensive data synthesis pipeline driven by Chain-of-Thought reasoning to bridge the gap between abstract user intent and fine-grained visual details, enabling the model to outperform significantly larger baselines on complex tasks like scientific diagram generation and precise text editing.
Dataset
InternVL-U Dataset Overview
The authors construct a large-scale training corpus for InternVL-U by combining high-quality open-source datasets with specialized synthetic data pipelines. This approach targets diverse multimodal generation and editing tasks, with a specific focus on long-tail domains like human portraits, text-rich imagery, and scientific reasoning.
-
Dataset Composition and Sources
- The initial data pool consists of publicly available image generation and editing datasets.
- Specialized subsets are augmented to address long-tail cases in human portraits and text-rich scenarios.
- Synthetic data is generated across five core domains: general, text-centric, science-centric, spatial-centric, and humor-centric.
-
Key Details for Each Subset
- General Data: Includes diverse visual domains such as portraits, posters, and natural scenes. It utilizes a dual-branch expansion workflow combining retrieval-based searches for long-tail concepts and synthesis-based generation for manifold densification.
- Text-Centric Data: Covers three types: semantically relevant text on natural images, text on solid-color backgrounds, and text editing within existing images (e.g., license plates, signboards).
- Science-Centric Data: Spans physics, chemistry, biology, and computer science. Physics data uses an SVG-based pipeline for high-quality image pairs, while computer science data focuses on data structures like trees, graphs, and finite state machines.
- Spatial-Centric Data: Derived from solid geometry (using GeoGebra), multi-view CAD (using the ABC dataset), and 3D object rotation (using Objaverse).
- Humor-Centric Data: Synthesized from internet memes to train the model on abstract intent, sarcasm, and visual-textual contrast.
-
Data Usage and Processing Strategies
- Preprocessing: The authors apply a rigorous multi-dimensional filtering protocol to exclude low-quality samples based on aesthetic scores, resolution, safety standards, and watermark detection. Near-duplicates are removed using perceptual hashing (p-hash).
- Captioning: A pre-trained MLLM (Qwen2.5-VL) generates captions at varying granularities, including concise, dense, and human-centric descriptions.
- Bilingual Support: An English-to-Chinese translation pipeline is applied across the dataset to ensure bilingual proficiency.
- Reasoning Enhancement: A reasoning-centric module converts abstract user instructions into structured, actionable specifications using Chain-of-Thought (CoT) reasoning. This process enriches prompts with detailed visual descriptions, spatial relationships, and domain-specific constraints.
- Synthesis Pipelines:
- Physics: Uses PaddleOCR to extract images from documents, followed by an SVG-based generation pipeline to create input-output pairs, reducing costs significantly compared to raster editing.
- Computer Science: Employs Python libraries (matplotlib, Graphviz) with fixed anchor points to ensure spatial consistency in data structure visualizations.
- Spatial Rotation: Utilizes an "Object-First" strategy for context integration and a "Background-First" strategy for strict background preservation during object rotation.
-
Evaluation Benchmark
- The authors introduce TextEdit, a human-curated benchmark for text-centric image editing.
- It covers 18 sub-classes across virtual and real-world scenes.
- Evaluation relies on manually annotated ground truth and a hybrid protocol combining OCR metrics, image fidelity measures, and multimodal LLM-based assessments.
Method
The authors propose InternVL-U, an efficient Unified Multimodal Model (UMM) designed to seamlessly integrate generative capabilities into a strong understanding backbone. The architecture is driven by three core design principles: Unified Contextual Modeling with Modality-Adaptive Generation, Structural Efficiency via Modality-Specific Modular Design, and Decoupled Visual Representations for Understanding and Generation.
Refer to the framework diagram for the high-level architectural design. The model addresses the dichotomy between multimodal understanding and generation by employing a unified autoregressive paradigm for contextualization while diverging for prediction targets. Text is modeled via a categorical distribution using cross-entropy loss, whereas visual signals are modeled in a continuous multivariate probability space using Flow Matching. To ensure structural efficiency, the model initializes its backbone with an encoder-based architecture (leveraging a pre-trained ViT) rather than a monolithic design, introducing an inductive bias that efficiently aggregates visual information. Furthermore, a dedicated generation head based on the Multimodal Diffusion Transformer (MMDiT) architecture is extended from the pre-trained MLLM. This hierarchical design allows the backbone to focus on semantic reasoning while the specialized stems and heads handle modality-specific translation. Crucially, the model adopts an asymmetric representation strategy: high-level semantic features from a ViT are used for understanding, while a separate Variational Autoencoder (VAE) compresses images into a latent space suitable for synthesis, avoiding the optimization trade-off between abstraction and pixel details.
The detailed architecture of the visual generation head is illustrated below.

The head employs Dual Projectors to map multimodal hidden states and VAE image latents into the conditioning space. To address scale mismatch, an additional normalization layer is introduced on the VLM branch. The core component is the Dual-Stream MMDiT Block, which utilizes a fully Dual-Stream architecture where context and target streams interact via joint self-attention but use disentangled parameters for QKVO projections and Feed-Forward Networks (FFNs). An element-wise Gating Mechanism is integrated into the attention block to enhance non-linearity and mitigate attention-sink phenomena. Additionally, the model employs Multimodal Scalable RoPE (MSRoPE) to encode positional information with unified 3D embeddings (temporal, height, width) for both generative targets and context visual tokens, ensuring rigorous preservation of spatial structures.
The training process is formulated as a joint optimization objective. For the textual component, the model minimizes the negative log-likelihood of target tokens using the standard Next-Token Prediction (NTP) objective. For the visual component, the Flow Matching framework with velocity parameterization is adopted to model the continuous distribution of image latents. The model regresses the velocity vector field that transports the probability density from a Gaussian noise distribution to the data distribution. The final training objective is a weighted sum of the discrete and continuous losses, with coefficients dynamically adjusted across different training stages.
A three-stage curriculum is designed to progressively unlock visual synthesis skills. In the first stage, Generation Head Pre-training, the MLLM is frozen while the generation head and projectors are trained on a mixture of text-to-image and image editing datasets. The second stage, Any-resolution Continued Pre-training, involves variable-resolution training (512 to 1024 pixels) with a frozen backbone to handle diverse aspect ratios. The final stage, Unified Supervised Finetuning, unfreezes the entire model to enable end-to-end optimization, mixing Chain-of-Thought reasoning data with image generation and editing data.
To support high-semantic-density tasks, comprehensive data synthesis pipelines are constructed. For image editing, a multi-agent framework generates instruction-edit pairs categorized into Global, Object, Attribute, and Compositional levels.

For text-to-image data, an automatic pipeline renders text on natural images and pure-color backgrounds with adaptive layout design.

For text-aware image editing, a three-stage pipeline employs OCR tools, MLLM-based instruction agents, and text-editing agents to generate high-quality paired samples.

During inference, Flow-DPM-Solver is adopted with 20 inference steps. Classifier-free guidance is used for both image and text conditions, with specific scales set for dropping the entire condition or text condition only.

Experiment
- Multimodal understanding and reasoning benchmarks validate that the unified training strategy retains strong visual-language comprehension while achieving a superior balance between understanding and generation, matching larger models despite a compact architecture.
- General image generation experiments confirm the model's ability to render intricate textures, nuanced lighting, and precise semantic alignment, outperforming other unified models with significantly fewer parameters.
- Text-centric generation and editing evaluations demonstrate state-of-the-art capabilities in rendering legible multilingual text and accurately modifying specific text regions while preserving background integrity and visual aesthetics.
- Knowledge-informed generation and reasoning-based editing tests show that integrating explicit reasoning steps significantly enhances the model's ability to execute complex logical constraints, scientific concepts, and multi-step instructions.
- Qualitative results across all domains highlight the model's robust controllability, high visual fidelity, and effectiveness in handling diverse tasks ranging from humor-centric memes to specialized scientific diagrams.