HyperAIHyperAI

Command Palette

Search for a command to run...

Cheers: 패치 세부 사항과 의미 표현을 분리하여 통합된 멀티모달 이해 및 생성 가능하게 함

초록

최근 멀티모달 모델링 분야에서 시각적 이해와 생성을 단일 모델 내에서 통합하는 것이 최첨단 주제로 부상하고 있습니다. 그러나 두 작업은 불일치하는 디코딩 체계와 시각적 표현을 요구하므로, 공유된 특징 공간 내에서 동시 최적화를 수행하는 것은 쉽지 않습니다. 본 논문에서는 'Cheers'라는 통합 멀티모달 모델을 제시합니다. Cheers 는 패치 수준의 세부 정보를 의미적 표현과 분리함으로써 멀티모달 이해를 위한 의미 안정성을 확보하고, 게이트형 세부 잔차 (gated detail residuals) 를 통해 이미지 생성의 충실도를 향상시킵니다. Cheers 는 세 가지 핵심 구성 요소로 이루어져 있습니다: (i) 이미지 잠재 상태를 의미 토큰으로 인코딩 및 압축하여 대형 언어 모델 (LLM) 의 조건부 학습을 효율화하는 통합 비전 토크나이저, (ii) 텍스트 생성을 위한 자기회귀적 디코딩과 이미지 생성을 위한 확산 디코딩을 통합하는 LLM 기반 트랜스포머, (iii) 먼저 시각적 의미를 디코딩한 후 비전 토크나이저로부터 의미적으로 게이트 처리된 세부 잔차를 주입하여 고주파 성분을 정제하는 캐스케이드 흐름 매칭 헤드. 널리 사용되는 벤치마크에서의 실험 결과, Cheers 는 시각적 이해와 생성 모두에서 기존 최첨단 통합 멀티모달 모델 (UMMs) 과 동등하거나 더 우수한 성능을 보였습니다. 또한 Cheers 는 4 배의 토큰 압축률을 달성하여 고해상도 이미지 인코딩 및 생성의 효율성을 크게 향상시켰습니다. 특히, Cheers 는 GenEval 및 MMBench 같은 주요 벤치마크에서 Tar-1.5B 보다 우수한 성능을 보이면서도 학습 비용은 20% 만 소요하여, 효율적이고 효과적인 (즉, 4 배 토큰 압축을 통한) 통합 멀티모달 모델링의 가능성을 입증했습니다. 향후 연구를 위해 모든 코드와 데이터를 공개할 예정입니다.

One-sentence Summary

Researchers from Tsinghua University, Xi'an Jiaotong University, and the University of Chinese Academy of Sciences present CHEERS, a unified multimodal model that decouples patch-level details from semantic representations to stabilize understanding and enhance generation fidelity. This approach outperforms Tar-1.5B on key benchmarks while requiring only 20% of the training cost.

Key Contributions

  • CHEERS addresses the optimization conflict in unified multimodal models by decoupling patch-level details from semantic representations to stabilize understanding and improve generation fidelity.
  • The model introduces a hybrid architecture featuring a unified vision tokenizer for efficient token compression, an LLM-based Transformer for mixed autoregressive and diffusion decoding, and a cascaded flow matching head that injects gated detail residuals.
  • Extensive experiments show that CHEERS outperforms the Tar-1.5B model on GenEval and MMBench benchmarks while requiring only 20% of the training cost and achieving a 4x token compression rate.

Introduction

Unified multimodal models aim to integrate visual comprehension and high-fidelity image generation within a single architecture, yet they struggle because these tasks require fundamentally different decoding mechanisms and visual representations. Prior approaches often force a compromise by using a single shared feature space, which leads to optimization conflicts where semantic understanding suffers from quantization errors or generative fidelity loses high-frequency details. The authors leverage a novel decoupling strategy in CHEERS that separates patch-level details from semantic representations to resolve this tension. Their approach utilizes a unified vision tokenizer for efficient semantic compression and a cascaded flow matching head that first synthesizes global structure before injecting semantically gated detail residuals, achieving superior performance in both understanding and generation with significantly reduced training costs.

Dataset

  • Dataset Composition and Sources: The authors construct a multi-stage training pipeline using a diverse mix of image-caption pairs, multimodal samples, and pure text data sourced from LLaVA-UHD-v3, ImageNet, Infinity-MM, TextAtlas5M, BLIP-3o, DiffusionDB, Objects365, Nemotron-Cascade, Echo-4o-Image, MoviePosters, and ShareGPT-4o-Image.

  • Key Details for Each Subset:

    • Stage I (Vision-Language Alignment): Utilizes 4.5M image-caption pairs from LLaVA-UHD-v3 and 1.3M ImageNet samples re-annotated by Qwen2.5-VL-3B, with the ImageNet dataset repeated 10 times to establish generative capability.
    • Stage II (General Pre-Training): Optimizes on 30M multimodal samples with a 3:6:1 ratio of understanding, generation, and text data, incorporating captions from Infinity-MM and LLaVA-UHD-v3, generation data from BLIP-3o and synthetic FLUX.2 outputs, and text from LLaVA-UHD-v3.
    • Stage III (Refined Pre-Training): Expands to 33M samples maintaining the 3:6:1 ratio, combining LLaVA-UHD-v3 instruction data, synthetic images generated via FLUX.2-klein-9B using DiffusionDB and LLaVA-OneVision-1.5 prompts, 466K compositional reasoning instructions based on Objects365, and text from Nemotron-Cascade.
    • Stage IV (Supervised Fine-Tuning): Focuses on 3.8M curated samples including high-quality subsets from Stage III, Echo-4o-Image, MoviePosters, and ShareGPT-4o-Image.
  • Model Usage and Training Strategy: The authors employ a four-stage progressive training approach where image resolution is fixed at 512x512. Stage I trains only randomly initialized modules, Stage II and III optimize all parameters except the VAE, and Stage IV performs supervised fine-tuning with a 1:1 batch ratio between understanding and generation tasks.

  • Processing and Metadata Details: Synthetic data is generated using FLUX.2-klein-9B with prompts from DiffusionDB and LLaVA-OneVision-1.5 to enhance generation and reasoning capabilities. Specific subsets like Objects365 are used to synthesize 466K instructions targeting compositional reasoning skills such as counting, color recognition, and spatial understanding.

Method

The authors present the CHEERS framework, which is designed to unify multimodal understanding and image generation through a specialized architecture. The design philosophy prioritizes the separation of feature granularities to handle the trade-off between semantic reasoning and pixel-perfect reconstruction. Refer to the framework diagram below which contrasts CHEERS with prior approaches such as Janus, BAGEL, and RAE. Unlike methods that process latent features directly or rely on a single encoder, the proposed architecture employs a dedicated Semantic Encoder for high-level features while maintaining a separate path for low-level latent features, ensuring that fine-grained details are not discarded during the encoding process.

The core of the system relies on a Unified Vision Tokenizer, a Unified LLM-based Transformer, and a Cascaded Flow Matching Head. As shown in the figure below, the Unified Vision Tokenizer bridges latent representations and unified semantic visual embeddings. It comprises a VAE decoder and a semantic encoder, specifically SigLIP2-ViT. Given an input image X\mathbf{X}X, the VAE encoder yields latent states z1\mathbf{z}_1z1. To unify diverse tasks, a task-dependent latent zt\mathbf{z}_tzt is formulated as zt=tz1+(1t)z0\mathbf{z}_t = t\mathbf{z}_1 + (1 - t)\mathbf{z}_0zt=tz1+(1t)z0, where z0\mathbf{z}_0z0 represents latent noise. For visual understanding, ttt is fixed at 1, while for image generation, ttt is sampled from (0,1)(0, 1)(0,1). Instead of processing these latents directly, zt\mathbf{z}_tzt is passed through a VAE decoder to reconstruct the pixel-level image, which is then encoded by the ViT backbone to extract high-level semantic tokens zs(t)\mathbf{z}_s^{(t)}zs(t). This reconstruction step is crucial as it circumvents the loss of fine-grained features often associated with direct latent processing.

These semantic tokens are subsequently processed by the Unified LLM-based Transformer. The authors utilize an autoregressive backbone where semantic visual tokens and text embeddings are concatenated into a unified input sequence. A bidirectional attention mask is applied to the visual tokens to capture global context, whereas a causal mask is employed for text tokens to enable autoregressive decoding. Depending on the task, the LLM outputs are routed to different decoding paradigms. For image generation, the continuous visual hidden states are decoded via the Cascaded Flow Matching Head.

This head explicitly decouples high-frequency visual details from low-frequency semantic features. It consists of two cascaded stages. The first stage performs low-resolution semantic generation, followed by a PixelShuffle module that up-samples the feature maps. In the second stage, high-frequency patch details are introduced to update the decoded features. As shown in the figure below, the intensity of high-frequency injection dynamically increases as the denoising step progresses, mirroring the hierarchical nature of human drawing. The injection is controlled by a gating network G()G(\cdot)G(), updating the features according to the equation: Zs(t)G(Zs(t))S(D(zt))+Zs(t)\mathbf { Z } _ { s } ^ { \prime ( t ) } \gets G ( \mathbf { Z } _ { s } ^ { \prime ( t ) } ) \odot S ( D ( \mathbf { z } _ { t } ) ) + \mathbf { Z } _ { s } ^ { \prime ( t ) }Zs(t)G(Zs(t))S(D(zt))+Zs(t) where G(Zs(t))G(\mathbf{Z}_{s}^{'(t)})G(Zs(t)) denotes a scalar map and \odot represents element-wise multiplication.

The framework is trained end-to-end using a unified optimization objective. For text and understanding tasks, a standard cross-entropy loss is used. For image generation, a flow matching loss is applied to predict the velocity field Vt\mathbf{V}_tVt. During inference, the latent trajectory evolves from Gaussian noise z0\mathbf{z}_0z0 to the terminal state z1\mathbf{z}_1z1 via numerical integration of the predicted velocity field, allowing for continuous-time flow-based sampling.

Experiment

  • Comprehensive benchmarking on multimodal understanding and visual generation tasks validates that CHEERS achieves competitive performance across general, OCR, spatial, and knowledge-focused domains while demonstrating superior data efficiency.
  • Analysis of the training pipeline confirms that generation capabilities improve progressively, with synthetic instruction-oriented data driving significant gains in fidelity and alignment compared to real-world data.
  • Visualization of high-frequency injection reveals a dynamic, coarse-to-fine mechanism where high-frequency signals are sparsely used for initial contours but intensify in later stages to refine textures and local details.
  • Ablation studies demonstrate that joint training for understanding and generation does not compromise comprehension and that high-frequency injection is essential for producing sharp, detailed images.
  • Experiments on architectural design prove that reconstructing pixels before semantic encoding is necessary to preserve fine-grained visual details required for OCR tasks.
  • Evaluation of emergent abilities shows that the unified visual tokenizer enables strong generalization to untrained tasks like image editing and multi-image composition through shared feature representation.

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp