HyperAIHyperAI

Command Palette

Search for a command to run...

Forçage de modalité pour une génération spatiale évolutive

Bardienus Pieter Duisterhof Deva Ramanan Jeffrey Ichnowski Justin Johnson Keunhong Park

Résumé

Les modèles de texte-à-image (T2I) intègrent des prioris spatiaux riches. La synthèse de scènes photoréalistes et encombrées nécessite une compréhension de la géométrie, incluant la perspective et l’échelle relative. Les travaux antérieurs ont adapté les modèles T2I pour exploiter ce prior en vue de la prédiction de la profondeur, mais ils reposent sur des données de densité de profondeur denses et impliquent des recettes complexes. Nous proposons le Modality Forcing, une méthode post-entraînement simple et évolutive pour la génération conjointe d’images et de cartes de profondeur, entraînée sur des données de profondeur clairsemées au moyen d’un seul DiT. Le Modality Forcing permet la génération conditionnelle et conjointe d’images et de profondeur, dans n’importe quelle permutation, en attribuant des niveaux de bruit distincts par modalité. Les décodeurs spécifiques à chaque modalité nous permettent d’entraîner les modèles sur des données de profondeur réelles et clairsemées, tout en obtenant des performances de prédiction de profondeur généralisables. Nous démontrons en outre que le Modality Forcing hérite de l’évolutivité du pré-entraînement des T2I : en entraînant à partir de zéro un ensemble de modèles T2I (de 370 millions à 3,3 milliards de paramètres), nous constatons que les modèles plus grands, entraînés sur de plus grandes quantités de données image, produisent des estimations de profondeur plus précises. Notre modèle le plus performant est concurrentiel par rapport aux estimateurs de profondeur monoculaire de pointe et réduit la métrique AbsRel de 57 % par rapport aux modèles génératifs existants combinant image et profondeur. Ces résultats apportent une preuve solide que la génération d’images constitue un objectif de pré-entraînement évolutive pour la perception spatiale.

One-sentence Summary

The authors propose Modality Forcing, a scalable post-training recipe for conditional and joint image-depth generation using a single DiT trained on sparse depth data that assigns separate noise levels per modality, demonstrating through training T2I models from scratch across 370M to 3.3B parameters that larger models produce more accurate depth and the strongest model reduces AbsRel by 57% relative to existing joint image-depth generative models, providing strong evidence that image generation is a scalable pre-training objective for spatial perception.

Key Contributions

  • This work introduces Modality Forcing, a post-training recipe that unifies monocular depth estimation, depth-to-image, and joint image-depth generation within a single DiT model. The method enables conditional and joint generation in any permutation by assigning separate noise levels per modality and utilizing per-modality decoders trained on sparse depth data.
  • A controlled scaling study reveals that depth prediction accuracy increases as T2I model parameters grow from 370M to 3.3B and training data expands to 1.92B images. These findings provide evidence that image generation serves as a scalable pre-training objective for spatial perception.
  • The strongest model competes with state-of-the-art monocular depth estimators and reduces AbsRel error by 57% relative to existing joint image-depth generative models. Performance benchmarks on FLUX.2-klein-9B demonstrate significant improvements over prior baselines without requiring dense supervision.

Introduction

Text-to-image models hold rich spatial priors for synthesizing photorealistic scenes, but adapting them for geometry tasks remains difficult. Prior approaches often rely on complex adapters or dense synthetic depth data, limiting scalability and excluding sparse real-world annotations. The authors introduce Modality Forcing, a streamlined post-training recipe that unifies image and depth generation within a single Diffusion Transformer. By assigning separate noise levels to each modality, the method enables flexible conditional and joint generation using sparse data. Their controlled scaling study further reveals that depth prediction accuracy improves with larger T2I models, confirming image generation as a scalable objective for spatial perception.

Method

The authors introduce Modality Forcing, a framework designed to unify joint RGB and depth generation, image-to-depth, and depth-to-image tasks within a single model. The core methodology involves post-training a pretrained text-to-image Diffusion Transformer (DiT) to model the joint distribution pθ(x,dc)p_{\theta}(\mathbf{x}, \mathbf{d} \mid \mathbf{c})pθ(x,dc). This approach assigns independent noise levels to each modality, allowing the model to support various generation permutations by fixing specific noise levels during inference.

Refer to the framework diagram for a visual overview of the training and inference pipeline:

The architecture processes three distinct input streams. Text prompts are encoded into text tokens via a frozen text embedder. For the visual modalities, the model accepts noisy latents for the RGB stream and noisy depth maps for the depth stream. A key design choice is the tokenization strategy. The RGB stream utilizes a pretrained VAE latent space, where noisy latents are projected into image tokens via an MLP. In contrast, the depth stream operates directly in pixel space to accommodate sparse real-world annotations. Noisy depth maps are tokenized by a dedicated depth tokenizer, and missing pixels are filled with isotropic Gaussian noise to signal unavailability.

These token streams are concatenated and fed into the DiT backbone. To handle the independent noise levels, the model employs Adaptive Layer Normalization (AdaLN) with per-modality timestep conditioning. Separate timestep embedders are used for the RGB and depth streams. The RGB stream reuses the pretrained timestep embedder, while the depth stream uses a freshly initialized one. Furthermore, a lightweight cross-stream mixing module allows each stream to observe the other modality's timestep, enabling the model to learn the coupling between RGB and depth noise schedules.

The output heads are also modality-specific. The RGB branch uses an MLP to predict denoised latents, which are then decoded into the final image by the frozen VAE decoder. The depth branch utilizes a depth detokenizer consisting of self-attention layers and a final linear projection to map depth tokens back to pixel space.

Training is conducted by sampling per-modality noise levels trgbt_{rgb}trgb and tdeptht_{depth}tdepth from [0,1][0, 1][0,1]. For joint generation, both are sampled freely. For image-to-depth, trgbt_{rgb}trgb is fixed at 0 while tdeptht_{depth}tdepth is sampled. Conversely, for depth-to-image, tdeptht_{depth}tdepth is fixed at 0. To prevent catastrophic forgetting of the rich priors learned during the initial text-to-image pretraining, the authors employ a self-distillation loss. This loss penalizes the student model for drifting from the original frozen T2I model's predicted velocity, with the penalty strength weighted based on the depth noise level to account for the informational value of the depth condition.

Experiment

This work evaluates Modality Forcing by training T2I models from scratch and applying the technique to FLUX.2-klein-9B to benchmark performance against specialist depth models. Controlled scaling experiments validate that depth generation quality improves reliably with increased T2I model capacity and pre-training data, highlighting the transfer of spatial priors from image generation. Qualitative comparisons demonstrate that the method produces robust depth maps and consistent point clouds that outperform existing joint generators while remaining competitive with top-tier depth estimation models.

The authors evaluate their Modality Forcing method against discriminative, generative, and joint depth estimation models across five benchmarks. The results demonstrate that their approach achieves state-of-the-art performance on multiple datasets, outperforming existing joint and generative baselines while remaining competitive with specialized discriminative models. The proposed method achieves the best results on NYUv2, ETH3D, and ScanNet benchmarks. It significantly outperforms other joint image-depth generation models and generative depth estimators. Performance is competitive with top discriminative models like MoGe-2, though slightly lower on KITTI and DIODE.

The the the table evaluates depth-to-image generation capabilities on the OpenImages 6K dataset, comparing the proposed method against existing baselines. The results indicate that the proposed approach achieves the highest image quality with the lowest FID score among all methods. While it outperforms others in image fidelity, it shows slightly lower depth consistency compared to the strongest baseline, JointDiT. The proposed method achieves the best image quality scores, outperforming all baselines in FID. JointDiT demonstrates superior depth consistency, achieving the lowest error rate in absolute relative depth estimation. The proposed method matches the depth accuracy of UniCon while providing significantly better image generation quality.

The authors perform a scaling experiment using a suite of T2I models to investigate whether depth generation quality improves with model size. The the the table shows that network depth is kept constant while token sizes and feed-forward dimensions increase alongside the parameter count. Results demonstrate that depth performance scales reliably with these increases in model capability and training data. The study compares models across a range of parameter counts to analyze scaling trends. Architectural depth remains fixed while internal dimensions like token size and FFN dimensions expand. Larger models in the suite correspond to increased token sizes and feed-forward dimensions.

The authors evaluate their Modality Forcing method across five benchmarks, demonstrating state-of-the-art depth estimation performance that outperforms joint and generative baselines while remaining competitive with specialized discriminative models. Additional experiments on depth-to-image generation on the OpenImages dataset reveal superior image fidelity, and scaling studies confirm that depth generation quality reliably improves with increased model parameters and training data. Although depth consistency is slightly lower than the strongest baseline, the method effectively balances high-quality image synthesis with accurate depth estimation.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp