HyperAIHyperAI

Command Palette

Search for a command to run...

Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

Abstract

A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.

One-sentence Summary

Researchers from Tsinghua University, Xi'an Jiaotong University, and the University of Chinese Academy of Sciences present CHEERS, a unified multimodal model that decouples patch-level details from semantic representations to stabilize understanding and enhance generation fidelity. This approach outperforms Tar-1.5B on key benchmarks while requiring only 20% of the training cost.

Key Contributions

  • CHEERS addresses the optimization conflict in unified multimodal models by decoupling patch-level details from semantic representations to stabilize understanding and improve generation fidelity.
  • The model introduces a hybrid architecture featuring a unified vision tokenizer for efficient token compression, an LLM-based Transformer for mixed autoregressive and diffusion decoding, and a cascaded flow matching head that injects gated detail residuals.
  • Extensive experiments show that CHEERS outperforms the Tar-1.5B model on GenEval and MMBench benchmarks while requiring only 20% of the training cost and achieving a 4x token compression rate.

Introduction

Unified multimodal models aim to integrate visual comprehension and high-fidelity image generation within a single architecture, yet they struggle because these tasks require fundamentally different decoding mechanisms and visual representations. Prior approaches often force a compromise by using a single shared feature space, which leads to optimization conflicts where semantic understanding suffers from quantization errors or generative fidelity loses high-frequency details. The authors leverage a novel decoupling strategy in CHEERS that separates patch-level details from semantic representations to resolve this tension. Their approach utilizes a unified vision tokenizer for efficient semantic compression and a cascaded flow matching head that first synthesizes global structure before injecting semantically gated detail residuals, achieving superior performance in both understanding and generation with significantly reduced training costs.

Dataset

  • Dataset Composition and Sources: The authors construct a multi-stage training pipeline using a diverse mix of image-caption pairs, multimodal samples, and pure text data sourced from LLaVA-UHD-v3, ImageNet, Infinity-MM, TextAtlas5M, BLIP-3o, DiffusionDB, Objects365, Nemotron-Cascade, Echo-4o-Image, MoviePosters, and ShareGPT-4o-Image.

  • Key Details for Each Subset:

    • Stage I (Vision-Language Alignment): Utilizes 4.5M image-caption pairs from LLaVA-UHD-v3 and 1.3M ImageNet samples re-annotated by Qwen2.5-VL-3B, with the ImageNet dataset repeated 10 times to establish generative capability.
    • Stage II (General Pre-Training): Optimizes on 30M multimodal samples with a 3:6:1 ratio of understanding, generation, and text data, incorporating captions from Infinity-MM and LLaVA-UHD-v3, generation data from BLIP-3o and synthetic FLUX.2 outputs, and text from LLaVA-UHD-v3.
    • Stage III (Refined Pre-Training): Expands to 33M samples maintaining the 3:6:1 ratio, combining LLaVA-UHD-v3 instruction data, synthetic images generated via FLUX.2-klein-9B using DiffusionDB and LLaVA-OneVision-1.5 prompts, 466K compositional reasoning instructions based on Objects365, and text from Nemotron-Cascade.
    • Stage IV (Supervised Fine-Tuning): Focuses on 3.8M curated samples including high-quality subsets from Stage III, Echo-4o-Image, MoviePosters, and ShareGPT-4o-Image.
  • Model Usage and Training Strategy: The authors employ a four-stage progressive training approach where image resolution is fixed at 512x512. Stage I trains only randomly initialized modules, Stage II and III optimize all parameters except the VAE, and Stage IV performs supervised fine-tuning with a 1:1 batch ratio between understanding and generation tasks.

  • Processing and Metadata Details: Synthetic data is generated using FLUX.2-klein-9B with prompts from DiffusionDB and LLaVA-OneVision-1.5 to enhance generation and reasoning capabilities. Specific subsets like Objects365 are used to synthesize 466K instructions targeting compositional reasoning skills such as counting, color recognition, and spatial understanding.

Method

The authors present the CHEERS framework, which is designed to unify multimodal understanding and image generation through a specialized architecture. The design philosophy prioritizes the separation of feature granularities to handle the trade-off between semantic reasoning and pixel-perfect reconstruction. Refer to the framework diagram below which contrasts CHEERS with prior approaches such as Janus, BAGEL, and RAE. Unlike methods that process latent features directly or rely on a single encoder, the proposed architecture employs a dedicated Semantic Encoder for high-level features while maintaining a separate path for low-level latent features, ensuring that fine-grained details are not discarded during the encoding process.

The core of the system relies on a Unified Vision Tokenizer, a Unified LLM-based Transformer, and a Cascaded Flow Matching Head. As shown in the figure below, the Unified Vision Tokenizer bridges latent representations and unified semantic visual embeddings. It comprises a VAE decoder and a semantic encoder, specifically SigLIP2-ViT. Given an input image X\mathbf{X}X, the VAE encoder yields latent states z1\mathbf{z}_1z1. To unify diverse tasks, a task-dependent latent zt\mathbf{z}_tzt is formulated as zt=tz1+(1t)z0\mathbf{z}_t = t\mathbf{z}_1 + (1 - t)\mathbf{z}_0zt=tz1+(1t)z0, where z0\mathbf{z}_0z0 represents latent noise. For visual understanding, ttt is fixed at 1, while for image generation, ttt is sampled from (0,1)(0, 1)(0,1). Instead of processing these latents directly, zt\mathbf{z}_tzt is passed through a VAE decoder to reconstruct the pixel-level image, which is then encoded by the ViT backbone to extract high-level semantic tokens zs(t)\mathbf{z}_s^{(t)}zs(t). This reconstruction step is crucial as it circumvents the loss of fine-grained features often associated with direct latent processing.

These semantic tokens are subsequently processed by the Unified LLM-based Transformer. The authors utilize an autoregressive backbone where semantic visual tokens and text embeddings are concatenated into a unified input sequence. A bidirectional attention mask is applied to the visual tokens to capture global context, whereas a causal mask is employed for text tokens to enable autoregressive decoding. Depending on the task, the LLM outputs are routed to different decoding paradigms. For image generation, the continuous visual hidden states are decoded via the Cascaded Flow Matching Head.

This head explicitly decouples high-frequency visual details from low-frequency semantic features. It consists of two cascaded stages. The first stage performs low-resolution semantic generation, followed by a PixelShuffle module that up-samples the feature maps. In the second stage, high-frequency patch details are introduced to update the decoded features. As shown in the figure below, the intensity of high-frequency injection dynamically increases as the denoising step progresses, mirroring the hierarchical nature of human drawing. The injection is controlled by a gating network G()G(\cdot)G(), updating the features according to the equation: Zs(t)G(Zs(t))S(D(zt))+Zs(t)\mathbf { Z } _ { s } ^ { \prime ( t ) } \gets G ( \mathbf { Z } _ { s } ^ { \prime ( t ) } ) \odot S ( D ( \mathbf { z } _ { t } ) ) + \mathbf { Z } _ { s } ^ { \prime ( t ) }Zs(t)G(Zs(t))S(D(zt))+Zs(t) where G(Zs(t))G(\mathbf{Z}_{s}^{'(t)})G(Zs(t)) denotes a scalar map and \odot represents element-wise multiplication.

The framework is trained end-to-end using a unified optimization objective. For text and understanding tasks, a standard cross-entropy loss is used. For image generation, a flow matching loss is applied to predict the velocity field Vt\mathbf{V}_tVt. During inference, the latent trajectory evolves from Gaussian noise z0\mathbf{z}_0z0 to the terminal state z1\mathbf{z}_1z1 via numerical integration of the predicted velocity field, allowing for continuous-time flow-based sampling.

Experiment

  • Comprehensive benchmarking on multimodal understanding and visual generation tasks validates that CHEERS achieves competitive performance across general, OCR, spatial, and knowledge-focused domains while demonstrating superior data efficiency.
  • Analysis of the training pipeline confirms that generation capabilities improve progressively, with synthetic instruction-oriented data driving significant gains in fidelity and alignment compared to real-world data.
  • Visualization of high-frequency injection reveals a dynamic, coarse-to-fine mechanism where high-frequency signals are sparsely used for initial contours but intensify in later stages to refine textures and local details.
  • Ablation studies demonstrate that joint training for understanding and generation does not compromise comprehension and that high-frequency injection is essential for producing sharp, detailed images.
  • Experiments on architectural design prove that reconstructing pixels before semantic encoding is necessary to preserve fine-grained visual details required for OCR tasks.
  • Evaluation of emergent abilities shows that the unified visual tokenizer enables strong generalization to untrained tasks like image editing and multi-image composition through shared feature representation.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp