HyperAIHyperAI

Command Palette

Search for a command to run...

Repräsentationszwang für Bottleneck-freie einheitliche multimodale Modelle

Zusammenfassung

Unified multimodale Modelle (UMMs) zielen darauf ab, Wahrnehmung und Generierung in einem einzigen Modell zu verarbeiten. Dennoch stützen sich bestehende UMMs nach wie vor auf ein eingefrorenes, separat vortrainiertes VAE für die Bildgenerierung, wodurch ein struktureller Flaschenhals entsteht. Ein naives Entfernen dieses Moduls führt zu einer Qualitätslücke, da das Modell sowohl hochrangige Strukturen als auch niedrigstufige Details direkt aus den Rohpixeln erlernen muss. In dieser Arbeit schlagen wir Representation Forcing (RF) vor, eine Technik, die diese Lücke schließt, indem sie die Vorhersage von Repräsentationen zu einer nativen Fähigkeit des Modells macht. Konkret zwingt RF den Decoder, visuelle Repräsentationen autoregressiv als intermediate tokens vor den Pixeln vorherzusagen; diese tokens verbleiben anschließend im Kontext, um die Pixel-Diffusion innerhalb desselben Backbones zu steuern. Durch die Umwandlung von Repräsentationen aus Wahrnehmungsausgaben in Generierungsziele eliminiert RF die Notwendigkeit eines externen generativen latenten Raums. Unsere Ergebnisse zeigen, dass RF sowohl dem Verständnis als auch der Generierung zugutekommt. Im Bereich der Bildgenerierung entspricht unser im Pixelraum arbeitendes Modell mit RF dem Stand der Technik bei VAE-basierten vereinheitlichten Modellen. Im Bereich des Bildverständnisses übertrifft RF im Pixelraum allgemein seine VAE-basierte Variante. Zusammen bieten diese Ergebnisse einen wirksamen Schritt hin zu end-to-end, flaschenhalsfreien UMMs.

One-sentence Summary

Representation Forcing eliminates structural bottlenecks in unified multimodal models by autoregressively predicting visual representations as intermediate tokens to guide pixel diffusion within a single backbone, yielding a pixel-space architecture that matches state-of-the-art VAE-based unified models in generation while generally outperforming them in image understanding.

Key Contributions

  • This work introduces Representation Forcing, a technique that eliminates reliance on frozen external variational autoencoders by training the decoder to autoregressively predict visual representations as intermediate tokens. These tokens remain in the sequence to provide structural guidance for pixel-space diffusion within a shared transformer backbone.
  • Controlled experiments demonstrate that the pixel-space model matches state-of-the-art VAE-based unified models on standard text-to-image generation benchmarks while preserving fine textures. The approach also outperforms its VAE-based counterpart on image understanding tasks, indicating superior compatibility with joint multimodal modeling.
  • By repurposing perception outputs as generation targets, the method establishes a single end-to-end representation space that learns both high-level structure and low-level details directly from raw pixels. This design advances fully unified multimodal modeling by enabling native capabilities without relying on independently pretrained latent spaces.

Introduction

Unified multimodal models aim to handle both visual perception and image synthesis within a single architecture, a foundational requirement for building general-purpose multimodal intelligence. Prior systems, however, depend on a frozen, separately pretrained VAE to compress images into latents before applying diffusion, which creates a structural bottleneck that caps generation quality and breaks end-to-end training. Removing this external component forces the network to simultaneously learn high-level semantics and low-level textures from raw pixels, causing a severe quality drop. The authors leverage Representation Forcing to solve this by training the decoder to autoregressively predict visual representations as intermediate tokens. These tokens stay in the attention context to guide pixel-space diffusion, successfully eliminating the VAE dependency while boosting both generation fidelity and multimodal understanding.

Method

The authors leverage a unified multimodal model architecture designed to enable end-to-end training of pixel-space generation without reliance on external latent spaces. The core framework, referred to as Representation Forcing, operates within a shared transformer backbone that processes a unified sequence of text tokens, representation tokens, and pixel patches. This sequence is structured to support both understanding and generation tasks, with the model jointly trained to predict text, generate visual representations, and render images. The architecture is built upon the Mixture-of-Transformers (MoT) design, where all tokens share the same self-attention layers but are routed to modality-specific feed-forward experts based on their type. These experts are dedicated to multimodal understanding, representation token prediction, and pixel generation, allowing the model to specialize its processing while maintaining a common attention mechanism.

The framework integrates an image encoder, specifically DINOv3 ViT-H+/16 with NaViT-style variable-resolution support, which is jointly trained with the rest of the model. During training, the encoder extracts continuous visual features from the input image, which are then discretized into a sequence of representation tokens via online vector quantization. This process uses an exponential moving average (EMA) copy of the encoder to stabilize the feature representations, ensuring that the codebook updates adapt gradually to evolving features. The codebook consists of K=16,384K=16,384K=16,384 learnable prototype embeddings, and each patch-level feature is assigned to the nearest prototype based on cosine similarity. This discretization produces a sequence of representation tokens that mirror the spatial layout of the image and are embedded with learnable 2D spatial and identity embeddings. The resulting representation tokens are integrated into the sequence as in-context conditioning for pixel-space generation.

The model is trained end-to-end with a combined objective that includes three components: a language modeling loss LLM\mathcal{L}_{\mathrm{LM}}LLM for text prediction, a representation prediction loss LRep\mathcal{L}_{\mathrm{Rep}}LRep for predicting the discretized visual representations, and a flow-matching loss LFM\mathcal{L}_{\mathrm{FM}}LFM for pixel generation. The flow-matching loss is formulated using xxx-prediction with velocity loss, where the decoder predicts the clean image patches from noisy inputs at time ttt, given by zt=tx+(1t)ϵ\mathbf{z}_t = t \mathbf{x} + (1-t) \boldsymbol{\epsilon}zt=tx+(1t)ϵ. The loss is defined as LFM=Evθv2\mathcal{L}_{\mathrm{FM}} = \mathbb{E} \| \mathbf{v}_{\theta} - \mathbf{v} \|^2LFM=Evθv2, with v=xϵ\mathbf{v} = \mathbf{x} - \boldsymbol{\epsilon}v=xϵ and vθ=(xθzt)/(1t)\mathbf{v}_{\theta} = (\mathbf{x}_{\theta} - \mathbf{z}_t)/(1-t)vθ=(xθzt)/(1t). During training, the representation tokens are provided as ground-truth targets, while at inference, the decoder generates them autoregressively from the text prompt alone. These predicted representation tokens remain in the sequence and act as in-context conditioning for the pixel-space diffusion process. The attention mechanism ensures that the high-level structure encoded in the representation tokens influences pixel generation through standard self-attention, without requiring additional cross-attention modules. The training pipeline supports classifier-free guidance by independently dropping the text conditioning and representation token sequence with a probability of 0.1 during training.

Experiment

The evaluation compares a novel pixel-space model enhanced with Representation Forcing against established VAE-based and unified multimodal architectures across standard text-to-image and visual understanding benchmarks. Generation tests demonstrate that this approach successfully bridges the quality gap between pixel-space and VAE-based methods, achieving competitive performance in compositional generation and dense prompt following. Understanding evaluations reveal that Representation Forcing consistently boosts high-level visual comprehension, with the pixel-space variant ultimately outperforming VAE-based counterparts by eliminating latent bottlenecks and enabling tighter alignment between multimodal representations. Collectively, these results validate the effectiveness of end-to-end pixel-space unified models in delivering both strong generative quality and robust visual understanding.

The authors evaluate a pixel-space model on text-to-image benchmarks, comparing its performance against existing models that rely on pretrained generative components. Results show that the model achieves competitive generation quality and improves understanding across multiple benchmarks, with greater gains observed in pixel-space settings compared to VAE-based approaches. Representation Forcing improves both generation and understanding performance across different model architectures. Pixel-space models with Representation Forcing outperform VAE-based models on most visual understanding benchmarks. Representation Forcing leads to significant improvements in high-level visual understanding tasks but shows minor reductions in document-oriented comprehension.

The authors evaluate their pixel-space model on text-to-image benchmarks and compare its performance with existing models. Results show that the model achieves competitive scores on GenEval, with improvements when using a language model rewriter, and demonstrates strong performance on image understanding tasks, particularly in general visual understanding, where it outperforms VAE-based approaches. The model achieves comparable or better performance than state-of-the-art unified models on compositional generation tasks. Representation Forcing improves understanding across multiple benchmarks, especially in general visual understanding. Pixel-space generation with Representation Forcing outperforms VAE-based generation on most evaluation tasks, indicating benefits from a unified representation space.

The authors evaluate the impact of Representation Forcing (RF) on image understanding performance across different generation pathways, comparing VAE-based and pixel-space models with and without RF. Results show that RF improves understanding in both settings, with greater gains on general visual understanding tasks and consistent improvements in pixel-space models over VAE-based ones. The authors attribute the superior performance of pixel-space models to tighter alignment between understanding and generation due to the absence of an external VAE latent space. RF improves understanding performance across both VAE-based and pixel-space generation pathways, with larger gains on general visual understanding tasks. Pixel-space models with RF outperform VAE-based models on most benchmarks, indicating better alignment between understanding and generation. Improvements are concentrated on tasks requiring high-level visual comprehension, while performance on document-oriented tasks shows only minor reductions.

The authors evaluate their pixel-space model, RF-Pixel, on text-to-image benchmarks and compare it to existing generation-only and unified multimodal models. Results show that RF-Pixel achieves competitive performance on compositional generation and dense prompt following, matching or exceeding state-of-the-art models without relying on separately pretrained generative components. The model also demonstrates strong understanding capabilities across various visual tasks, particularly benefiting from representation forcing in pixel-space generation. RF-Pixel achieves competitive performance on compositional generation and dense prompt following, matching state-of-the-art unified models without using pretrained generative modules. Representation Forcing consistently improves understanding across benchmarks, with larger gains observed in pixel-space generation compared to VAE-based generation. Pixel-space generation with Representation Forcing outperforms VAE-based generation on most benchmarks, especially in general visual understanding tasks.

The authors evaluate a pixel-space model with Representation Forcing (RF) on image generation and understanding tasks, comparing it against VAE-based and pixel-space models with and without RF. Results show that RF improves understanding performance across both pathways, with greater gains in pixel-space models, particularly on general visual understanding benchmarks, while slightly reducing performance on document-oriented tasks. The pixel-space model with RF outperforms the VAE-based model with RF on most benchmarks, indicating tighter integration between understanding and generation in a unified representation space. RF improves understanding performance on both VAE-based and pixel-space models, with larger gains on general visual understanding benchmarks. Pixel-space models with RF outperform VAE-based models with RF on six out of eight benchmarks, especially in general visual understanding. RF leads to slight reductions in performance on document and diagram understanding tasks, suggesting limitations in text and layout comprehension.

The experiments evaluate a pixel-space model enhanced with Representation Forcing on text-to-image generation and visual understanding benchmarks, validating its performance against VAE-based architectures and pretrained unified models. Results demonstrate that Representation Forcing consistently improves visual comprehension across both generation pathways, with pixel-space implementations yielding superior understanding due to tighter alignment between generation and representation. While the model achieves competitive generation quality without relying on pretrained components, it shows minor trade-offs in document-oriented and layout comprehension tasks. Overall, the findings confirm that unifying generation and understanding within a single pixel-space representation effectively bridges visual reasoning and compositional capabilities.


KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren
Sofort einsatzbereite GPUs
Die besten Preise

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp