HyperAIHyperAI

Command Palette

Search for a command to run...

Mobile-O: Einheitliche multimodale Wahrnehmung und Generierung auf mobilen Geräten

Zusammenfassung

Einheitliche multimodale Modelle können sowohl visuelle Inhalte verstehen als auch generieren, und zwar innerhalb einer einzigen Architektur. Bestehende Modelle sind jedoch datenintensiv und zu schwer, um auf Edge-Geräten eingesetzt zu werden. Wir präsentieren Mobile-O, ein kompaktes Vision-Sprache-Diffusions-Modell, das einheitliche multimodale Intelligenz auf mobilen Geräten ermöglicht. Sein zentrales Modul, der Mobile Conditioning Projector (MCP), kombiniert Vision-Sprache-Features mit einem Diffusions-Generator mithilfe von depthwise-separable Faltungen und layerweiser Ausrichtung. Diese Architektur ermöglicht eine effiziente, kreuzmodale Bedingung mit minimalen Rechenkosten. Mobile-O wurde mit lediglich wenigen Millionen Trainingsbeispielen trainiert und anschließend in einer neuartigen Quadruplett-Form (Generierungs-Prompt, Bild, Frage, Antwort) nachtrainiert, wodurch sowohl die visuelle Verstehens- als auch die Generierungsfähigkeit gemeinsam verbessert werden. Trotz seiner Effizienz erreicht Mobile-O wettbewerbsfähige oder überlegene Leistung gegenüber anderen einheitlichen Modellen: Es erzielt 74 % auf GenEval und übertrifft Show-O und JanusFlow um 5 % bzw. 11 %, während es jeweils 6- und 11-mal schneller läuft. Für die visuelle Verstehensleistung übertrifft Mobile-O diese Modelle im Durchschnitt über sieben Benchmarks um 15,3 % und 5,1 %. Auf einem iPhone benötigt es lediglich etwa 3 Sekunden pro 512×512-Bild, wodurch Mobile-O den ersten praktikablen Rahmen für Echtzeit-Verstehen und Generierung einheitlicher multimodaler Inhalte auf Edge-Geräten schafft. Wir hoffen, dass Mobile-O zukünftige Forschung im Bereich der Echtzeit-integrierten multimodalen Intelligenz auf Geräten ohne Cloud-Abhängigkeit erleichtert. Unser Code, die Modelle, die Datensätze und die mobile Anwendung sind öffentlich unter https://amshaker.github.io/Mobile-O/ verfügbar.

One-sentence Summary

Researchers from MBZUAI, CMU, and Linköping University propose Mobile-O, a compact vision-language-diffusion model with a Mobile Conditioning Projector that enables real-time unified multimodal understanding and generation on mobile devices, outperforming prior models in speed and accuracy while requiring minimal data and no cloud dependency.

Key Contributions

  • Mobile-O introduces a compact vision-language-diffusion model optimized for edge devices, addressing the inefficiency and data hunger of existing unified multimodal models through a lightweight architecture and training on just a few million samples.
  • Its core innovation, the Mobile Conditioning Projector (MCP), uses depthwise-separable convolutions and layerwise alignment to fuse vision-language features with diffusion generation efficiently, while a novel quadruplet training format (prompt, image, question, answer) jointly improves understanding and generation.
  • Mobile-O achieves state-of-the-art results with 74% on GenEval, outperforming Show-O and JanusFlow by 5% and 11% respectively, and runs up to 11x faster, while also surpassing them by 15.3% and 5.1% on average across seven understanding benchmarks, all while generating 512x512 images in ~3s on an iPhone.

Introduction

The authors leverage unified multimodal models that combine visual understanding and image generation in a single architecture, aiming to bring this capability to edge devices like smartphones where compute and memory are limited. Prior models are too large and data-hungry, often requiring billions of parameters and hundreds of millions of training samples, making them impractical for on-device deployment. Mobile-O addresses this by introducing a lightweight Mobile Conditioning Projector that fuses vision-language features with a diffusion generator using depthwise-separable convolutions, enabling efficient cross-modal conditioning. It also introduces a novel quadruplet training format (prompt, image, question, answer) to jointly optimize understanding and generation with only 105k samples. As a result, Mobile-O achieves state-of-the-art performance on both tasks while running up to 11x faster than prior unified models and completing 512x512 image generation in ~3 seconds on an iPhone.

Method

The authors leverage a unified mobile-optimized architecture that jointly supports multimodal visual understanding and text-to-image generation. The framework integrates a vision-language encoder-decoder for understanding tasks with a diffusion-based image decoder for generation, connected via a lightweight Mobile Conditioning Projector (MCP) that enables efficient cross-modal alignment without introducing additional token sequences.

At the core, the visual understanding pathway begins with an image encoder and an autoregressive language model. Given an image and a question, the model generates an answer through an I2T (image-to-text) loss computed over the language model’s output tokens. For generation, a text prompt is processed by the same language model, whose hidden states are projected via the MCP into conditioning features for a DiT-style diffusion decoder. The generated image is then reconstructed via a VAE decoder, with training guided by a T2I (text-to-image) loss based on flow matching.

Refer to the framework diagram, which illustrates the dual pathways and their shared components. The Mobile Conditioning Projector, shown in detail on the right, fuses the final K layers of the VLM using temperature-scaled softmax weights, compresses the fused representation via a linear layer, and refines it using depthwise-separable 1D convolutions and lightweight channel attention. The output is projected to match the diffusion decoder’s conditioning dimension, enabling end-to-end alignment without query tokens.

The training proceeds in three stages. In Stage 1, the visual encoder and language model are frozen while the DiT and MCP are trained on large-scale image-text pairs to establish cross-modal alignment. Stage 2 performs supervised fine-tuning on curated prompt-image pairs to address domain-specific weaknesses. Stage 3 introduces a unified multimodal post-training phase, where quadruplet samples (generation prompt, image, question, answer) are used to jointly optimize both I2T and T2I objectives. This stage enables bidirectional learning: the same embedding layer and language model support both understanding and generation, with the MCP serving as the shared conditioning interface.

As shown in the training dynamics diagram, components are selectively frozen or updated across stages. During Stage 1 and 2, only the generation components (DiT and MCP) are learnable, while the understanding backbone remains frozen. In Stage 3, the unified components—including the MCP and parts of the language model—are fine-tuned via LoRa, enabling joint optimization without destabilizing the pretrained representations.

The unified loss function combines the I2T language modeling loss and the T2I flow-matching loss with learnable weights:

Lunified=λlangLlang+λdiffLdiff\mathcal{L}_{\mathrm{unified}} = \lambda_{\mathrm{lang}} \mathcal{L}_{\mathrm{lang}} + \lambda_{\mathrm{diff}} \mathcal{L}_{\mathrm{diff}}Lunified=λlangLlang+λdiffLdiff

The I2T loss is standard cross-entropy over answer tokens conditioned on image and question. The T2I loss minimizes the weighted MSE between the predicted velocity field and the ground-truth flow, enabling stable and efficient training compared to traditional noise prediction objectives. The MCP’s design ensures minimal computational overhead, with complexity dominated by O(kdh+dh2)\mathcal{O}(k\,d_h + d_h^2)O(kdh+dh2) per token, avoiding expensive 2D convolutions or attention over new tokens.

Experiment

  • Mobile-O validates efficient unified multimodal understanding and generation on mobile devices, outperforming prior models like Show-O and Janus in both speed and quality while fitting under 2GB memory.
  • It achieves superior text-to-image generation and visual understanding via a shared Mobile Conditioning Projector and multi-task post-training, improving both capabilities without architectural changes.
  • Image editing is enabled with minimal fine-tuning data, preserving scene structure while applying localized edits, demonstrating emergent capability from unified design.
  • Ablation studies confirm the MCP’s effectiveness in cross-modal alignment and the value of joint quadruplet training for boosting both generation and understanding.
  • Scaling to larger backbones (Mobile-O-1.5B) shows the framework generalizes, enhancing both understanding and generation beyond component baselines.
  • Edge deployment on iPhone, Jetson, and MacBook proves real-time performance, with 3-second image generation and sub-0.5s text responses, enabling offline, privacy-preserving mobile AI.
  • Qualitative results highlight superior fine detail rendering, prompt fidelity, and robust OCR/scene understanding, even under challenging real-world conditions.
  • Limitations include constrained text representation due to shared lightweight LLM, trading expressiveness for deployability on resource-limited devices.

The authors evaluate different configurations of the Mobile Conditioning Projector (MCP) and find that using four layers with learnable fusion and a refinement block achieves the highest accuracy at 70.4% while keeping parameters under 2.4M. Results show that aggregating multiple VLM layers improves cross-modal alignment, and adding learnable weights and a refinement block further enhances performance without requiring excessive computational overhead.

The authors use a unified post-training strategy with quadruplet samples to jointly optimize both understanding and generation tasks. Results show that this approach improves average understanding accuracy by 1.6 percentage points and generation accuracy by 0.8 percentage points compared to using only image-text pairs or standard supervised fine-tuning. This demonstrates that multi-objective post-training enhances cross-modal alignment without requiring large-scale pre-training.

Mobile-O-0.5B achieves significantly faster inference speeds across all tested edge devices compared to prior unified models, with latency reductions of up to 46x for image generation on MacBook M2 Pro and under 3 seconds on iPhone 17 Pro. The model maintains efficient memory usage below 2GB while delivering real-time performance suitable for on-device deployment. Results confirm that architectural optimizations enable high-speed multimodal operations without compromising functional capabilities.

The authors evaluate Mobile-O-0.5B against other unified and generation-only models on text-to-image generation benchmarks, showing it achieves the highest overall score among models under 2B parameters. Results indicate Mobile-O-0.5B excels in generating images with accurate object counts, precise color attributes, and correct spatial positioning, outperforming comparable unified models like Show-O and JanusFlow. This performance is achieved while maintaining a compact parameter count and efficient inference, supporting its suitability for edge deployment.

The authors use a larger backbone configuration to scale Mobile-O to 1.5B parameters and evaluate its performance after supervised fine-tuning and post-training. Results show that post-training improves both understanding accuracy and generation quality, with the final model outperforming its individual component backbones. This confirms that the Mobile-O framework effectively enhances unified multimodal capabilities even when scaled to larger architectures.


KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren
Sofort einsatzbereite GPUs
Die besten Preise

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp