Command Palette
Search for a command to run...
Ji Xie Trevor Darrell Luke Zettlemoyer XuDong Wang

Abstract
Unified multimodal models (UMMs) unify visual understanding and generationwithin a single architecture. However, conventional training relies onimage-text pairs (or sequences) whose captions are typically sparse and missfine-grained visual details--even when they use hundreds of words to describe asimple image. We introduce Reconstruction Alignment (RecA), aresource-efficient post-training method that leverages visual understandingencoder embeddings as dense "text prompts," providing rich supervision withoutcaptions. Concretely, RecA conditions a UMM on its own visual understandingembeddings and optimizes it to reconstruct the input image with aself-supervised reconstruction loss, thereby realigning understanding andgeneration. Despite its simplicity, RecA is broadly applicable: acrossautoregressive, masked-autoregressive, and diffusion-based UMMs, itconsistently improves generation and editing fidelity. With only 27 GPU-hours,post-training with RecA substantially improves image generation performance onGenEval (0.73rightarrow0.90) and DPGBench (80.93rightarrow88.15), whilealso boosting editing benchmarks (ImgEdit 3.38rightarrow3.75, GEdit6.94rightarrow7.25). Notably, RecA surpasses much larger open-source modelsand applies broadly across diverse UMM architectures, establishing it as anefficient and general post-training alignment strategy for UMMs
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.