HyperAIHyperAI

Command Palette

Search for a command to run...

3DreamBooth: Ein 3D-Modell zur hochfidelitätsgesteuerten Videoerzeugung auf Basis von Subjekten

Hyun-kyu Ko Jihyeon Park Younghyun Kim Dongheok Park Eunbyung Park

Zusammenfassung

Die Erzeugung dynamischer, ansichtskonsistenter Videos personalisierter Subjekte ist für eine Vielzahl aufkommender Anwendungen höchst gefragt, darunter immersive VR/AR-Erlebnisse, virtuelle Produktionen und der E-Commerce der nächsten Generation. Trotz rascher Fortschritte bei der subjektgetriebenen Videogenerierung behandeln bestehende Methoden Subjekte jedoch vorwiegend als 2D-Entitäten und konzentrieren sich darauf, Identität über visuelle Merkmale aus Einzelaufnahmen oder textuelle Prompts zu übertragen. Da reale Subjekte inhärent dreidimensional sind, offenbart die Anwendung dieser 2D-zentrierten Ansätze auf die 3D-Objektanpassung eine grundlegende Einschränkung: Ihnen fehlen die umfassenden räumlichen Priors, die für die Rekonstruktion der 3D-Geometrie erforderlich sind. Folglich müssen sie bei der Synthese neuer Ansichten plausible, aber willkürliche Details für nicht gesehene Regionen generieren, anstatt die wahre 3D-Identität zu bewahren. Die Erreichung einer echten 3D-bewussten Anpassung bleibt aufgrund des Mangels an multiviewigen Videodatasets eine Herausforderung. Zwar könnte man versuchen, Modelle auf begrenzten Videosequenzen zu fine-tunen, dies führt jedoch häufig zu einem temporalen Overfitting. Um diese Probleme zu lösen, stellen wir ein neuartiges Framework für die 3D-bewusste Videoanpassung vor, das aus 3DreamBooth und 3Dapter besteht. 3DreamBooth entkoppelt die räumliche Geometrie von der zeitlichen Bewegung durch ein Optimierungsparadigma mit nur einem Einzelbild. Durch die Beschränkung der Aktualisierungen auf räumliche Repräsentationen wird ein robuster 3D-Prior effektiv in das Modell eingebettet, ohne dass ein exhaustives videobasiertes Training erforderlich ist. Um feingranulare Texturen zu verbessern und die Konvergenz zu beschleunigen, integrieren wir 3Dapter, ein visuelles Konditionierungsmodul. Nach dem Pre-Training mit Einzelaufnahmen durchläuft 3Dapter eine multiviewige gemeinsame Optimierung mit dem Hauptgenerierungszweig mittels einer asymmetrischen Konditionierungsstrategie. Dieses Design ermöglicht es dem Modul, als dynamischer selektiver Router zu fungieren, der ansichtsspezifische geometrische Hinweise aus einem minimalen Referenzset abfragt. Projektseite: https://ko-lani.github.io/3DreamBooth/

One-sentence Summary

Researchers from Yonsei University and Sungkyunkwan University propose 3DreamBooth and 3Dapter, a framework that decouples spatial geometry from temporal motion via 1-frame optimization to generate high-fidelity, view-consistent videos of customized 3D subjects for immersive VR and virtual production.

Key Contributions

  • The paper introduces 3DreamBooth, a 1-frame optimization strategy that integrates subject-specific 3D identity into video diffusion models by restricting updates to spatial representations, thereby avoiding the need for multi-view video datasets while preventing temporal overfitting.
  • A multi-view conditioning module called 3Dapter is presented to enhance fine-grained textures and accelerate convergence through a two-stage pipeline that employs an asymmetrical conditioning strategy to query view-specific geometric hints from a minimal reference set.
  • The work establishes 3D-CustomBench, a curated evaluation suite for 3D-consistent video customization, with experiments demonstrating that the combined framework outperforms existing single-reference baselines in generating high-fidelity, identity-preserving videos.

Introduction

The demand for immersive VR/AR experiences and virtual production requires generative systems that can place customized subjects into dynamic environments while maintaining strict visual consistency across different viewpoints. Current subject-driven video generation methods largely treat objects as 2D entities, relying on single-view references or text prompts that fail to capture the underlying 3D geometry. This limitation forces models to hallucinate arbitrary details for unseen angles rather than preserving true spatial identity, while attempts to train on limited multi-view video data often result in temporal overfitting. To resolve these issues, the authors introduce 3DreamBooth, a framework that decouples spatial geometry from temporal motion through a 1-frame optimization paradigm to bake robust 3D priors without exhaustive video training. They further enhance this approach with 3Dapter, a multi-view conditioning module that uses an asymmetrical strategy to inject fine-grained geometric hints, enabling high-fidelity and view-consistent video generation with improved computational efficiency.

Dataset

  • Dataset Composition and Sources: The authors utilize two primary datasets: the Subjects200K dataset for single-view pre-training and the newly introduced 3D-CustomBench for evaluation. 3D-CustomBench combines objects from the MVIgNet dataset with custom-captured 3D objects to ensure complete 360° orbital coverage.

  • Key Details for Each Subset:

    • Subjects200K: Contains over 30,000 distinct subject descriptions generated by GPT-4o, which are used to synthesize paired images via the FLUX.1 model. These pairs share identical subject identities but feature varied poses, lighting, and backgrounds to prevent overfitting.
    • 3D-CustomBench: Comprises 30 distinct objects selected for complex 3D structures, non-trivial topologies, and high texture resolution. Each object includes a full multi-view sequence of approximately 30 images.
  • Data Usage and Training Strategy:

    • Pre-training: The model leverages Subjects200K to train 3Dapter using Low-Rank Adaptation (LoRA) on key image-processing modules. Training runs for 100,000 iterations with a global batch size of 4 and a learning rate of 1×1041 \times 10^{-4}1×104 using the AdamW optimizer.
    • Evaluation: For 3D-CustomBench, the full multi-view sequence of each object is used for 3DreamBooth optimization. The authors sample Nc=4N_{c} = 4Nc=4 conditioning views for 3Dapter that maximize angular coverage while minimizing visual overlap.
  • Processing and Metadata Construction:

    • Automated Curation: GPT-4o generates subject descriptions and acts as an automated evaluator to discard misaligned samples during the construction of Subjects200K.
    • Prompt Generation: GPT-4o automatically creates one challenging validation prompt per object in 3D-CustomBench, featuring diverse backgrounds and complex dynamics like human-object interactions.
    • View Selection: Conditioning views are strategically selected to ensure optimal angular distribution rather than random sampling.

Method

The authors introduce 3DreamBooth, a framework for 3D customization of video diffusion models. The core strategy involves decoupling spatial identity from temporal dynamics through a 1-frame training paradigm.

Refer to the framework diagram for the overall architecture. The pipeline accepts a set of multiview images where one image serves as the target and a subset acts as reference conditions. A global prompt with a unique identifier VVV guides the generation. The text and noisy target latents are processed through the main branch using 3DB LoRA, while reference latents pass through a shared 3Dapter. These features are concatenated for Multi-view Joint Attention. By restricting the input to a single frame (T=1T=1T=1), the model bypasses temporal attention, focusing updates on spatial representations to learn a 3D prior without temporal overfitting.

As shown in the figure below, the 3Dapter module utilizes a two-stage conditioning mechanism. The first stage involves single-view pre-training on reference-target pairs. The second stage performs multi-view joint optimization where a shared 3Dapter processes multiple conditioning views in parallel. The Multi-view Joint Attention acts as a dynamic selective router, querying relevant view-specific geometric hints to reconstruct the target view. This shared architecture ensures consistent geometric feature extraction across viewpoints without increasing parameter count linearly.

As shown in the figure below, the training process incorporates a 3D consistency check to ensure geometric fidelity. Generated frames undergo depth estimation and foreground masking to produce a predicted point cloud. This cloud is aligned with a reference point cloud to compute the Chamfer distance, which serves as a constraint during optimization. This approach allows the model to focus on learning geometric transformations while preserving high-frequency details, effectively overcoming the information bottleneck of text-driven customization.

Experiment

  • Comparative experiments against VACE and Phantom validate that the proposed framework achieves superior multi-view subject fidelity, preserving identity, shape, color, and fine-grained details during 360-degree rotations where baselines fail to reconstruct unseen viewpoints.
  • 3D geometric fidelity evaluations demonstrate that the method significantly reduces reconstruction error and improves surface coverage compared to single-view approaches, confirming its ability to recover complete 3D structures from multi-view conditioning.
  • Ablation studies reveal that the synergistic combination of 3Dapter and 3DreamBooth is essential, as 3Dapter alone lacks 3D consistency while 3DreamBooth alone struggles with texture details and slow convergence, whereas their joint optimization ensures both structural accuracy and high-frequency detail preservation.
  • Additional tests confirm the framework's robustness across diverse object categories and dynamic scenarios, as well as its extensibility to other Diffusion Transformer architectures without requiring explicit spatial conditioning modules.
  • Analysis of training dynamics shows that pre-training the 3Dapter module is critical to prevent optimization collapse and enable rapid convergence, while the full framework achieves high-fidelity results in fewer iterations than text-driven baselines.

KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren
Sofort einsatzbereite GPUs
Die besten Preise

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp