HyperAIHyperAI

Command Palette

Search for a command to run...

Apprentissage par renforcement guidé par la géométrie pour l'édition de scènes 3D cohérente multi-vues

Résumé

L'exploitation des connaissances a priori des modèles de diffusion 2D pour l'édition 3D s'est imposée comme un paradigme prometteur. Toutefois, le maintien de la cohérence multi-vue dans les résultats édités demeure un défi majeur, et l'extrême rareté de paires de données d'édition cohérentes en 3D rend impossible l'ajustement fin supervisé (SFT), pourtant la stratégie d'entraînement la plus efficace pour les tâches d'édition. Dans cet article, nous constatons que, si la génération de contenu 3D cohérent multi-vue est hautement complexe, la vérification de cette cohérence 3D reste abordable, plaçant naturellement l'apprentissage par renforcement (RL) comme une solution viable. Inspirés par cette observation, nous proposons RL3DEdit, un cadre en une seule passe piloté par une optimisation par RL, intégrant des récompenses novatrices dérivées du modèle fondation 3D VGGT. Plus précisément, nous exploitons les connaissances a priori robustes apprises par VGGT à partir d'immenses volumes de données réelles, alimentons le modèle avec les images éditées, et utilisons les cartes de confiance produites ainsi que les erreurs d'estimation de pose comme signaux de récompense. Cette approche permet d'ancrer efficacement les connaissances a priori de l'édition 2D sur une variété cohérente en 3D via le RL. Des expériences approfondies démontrent que RL3DEdit assure une cohérence multi-vue stable et surpasse les méthodes de l'état de l'art en termes de qualité d'édition, tout en offrant une efficacité élevée. Afin de soutenir le développement de l'édition 3D, nous mettrons à disposition le code et les modèles correspondants.

One-sentence Summary

Researchers from BJTU, AMap Alibaba Group, NTU, and CQUPT propose RL3DEdit, a single-pass 3D editing framework that leverages reinforcement learning with VGGT-derived rewards to ensure multi-view consistency, overcoming data scarcity and outperforming iterative methods in both quality and efficiency for AR and gaming applications.

Key Contributions

  • Current 3D editing methods struggle with multi-view consistency and cannot utilize supervised fine-tuning due to the extreme scarcity of 3D-consistent paired data.
  • The proposed RL3DEdit framework leverages the 3D foundation model VGGT to generate novel reward signals from confidence maps and pose errors, enabling RL optimization that anchors 2D editing priors onto a 3D-consistent manifold without requiring paired datasets.
  • Extensive experiments demonstrate that this single-pass approach achieves state-of-the-art editing quality and stable multi-view consistency while operating more than twice as fast as previous iterative optimization methods.

Introduction

3D scene editing is critical for AR/VR and gaming applications, yet current methods struggle to maintain geometric coherence while leveraging powerful 2D diffusion models. Prior approaches suffer from inefficiency due to iterative optimization, produce blurry artifacts from inconsistent signals, or fail to handle edits that alter scene geometry because they rely on depth maps or attention propagation. The authors address these challenges by introducing RL3DEdit, a single-pass framework that uses reinforcement learning to optimize 2D editors for 3D consistency without requiring scarce paired training data. They leverage the 3D foundation model VGGT as a robust verifier to generate geometry-aware reward signals, effectively anchoring 2D editing priors onto a 3D-consistent manifold while achieving state-of-the-art quality and over twice the speed of existing methods.

Method

The authors propose RL3DEdit, a framework that leverages Reinforcement Learning to equip a 2D foundation model with 3D-consistency priors. The overall architecture is illustrated in the framework diagram. Given a 3D asset, the system first renders it from MMM viewpoints to obtain a set of images {Im}m=1M\{I_m\}_{m=1}^M{Im}m=1M. These images are fed simultaneously into a 2D editor, denoted as π\piπ, for joint multi-view editing. During inference, the fine-tuned editor produces multi-view consistent images in a single forward pass, which are subsequently processed by 3D Gaussian Splatting (3DGS) reconstruction to yield the final edited 3D scene.

To address the core challenge of ensuring 3D consistency without paired supervision, the authors employ the Group Relative Policy Optimization (GRPO) algorithm. During training, the system explores a group of GGG edited results through independent inference passes. A dedicated 3D-aware reward model, implemented via VGGT, is utilized to explicitly enforce both editing faithfulness and multi-view coherence. This model jointly assesses three critical aspects of multi-view consistency, represented as depth confidence rDr^DrD, point confidence rPr^PrP, and relative pose reward rTr^TrT, alongside an editing quality term rar^ara. These complementary rewards are combined to form the final composite reward RiR^iRi, which guides the optimization toward consistent and high-quality 3D-aware editing.

The choice of the 2D backbone is critical for enabling cross-view interaction. The authors adopt FLUX-Kontext, a DiT-based model that naturally supports multi-image joint editing through global attention mechanisms. This capability allows the model to process all input views as a concatenated sequence, facilitating the necessary cross-view interactions for 3D consistency. The versatility of this approach is demonstrated by the diverse editing capabilities shown in the qualitative results, which include motion, replacement, style transfer, background modification, and object addition.

A key component of the training process is the anchor reward, designed to preserve the original 2D editing fidelity of the foundation model. By comparing the edited anchor view against a pre-computed high-quality 2D edit, the model ensures that semantic correctness and visual details are maintained while learning 3D priors. As illustrated in the comparison of editing capabilities, the RL fine-tuned model successfully preserves the original 2D editing fidelity of the base model, as evidenced by the comparable VIE Score metrics.

Finally, the relative advantage AiA^iAi is computed from the group rewards, and the model is optimized by maximizing the following objective:

J(θ)=Jclip(θ)βDKL(πθπref)J(\theta) = J_{\text{clip}}(\theta) - \beta D_{\text{KL}}(\pi_\theta || \pi_{\text{ref}})J(θ)=Jclip(θ)βDKL(πθ∣∣πref)

where πθ\pi_\thetaπθ and πref\pi_{\text{ref}}πref denote the fine-tuned and original 2D editors, respectively. This formulation allows the model to learn 3D-consistency priors effectively without requiring curated paired data.

Experiment

  • Comparative experiments against state-of-the-art 3D editing methods demonstrate that the proposed approach achieves superior instruction following, visual fidelity, and multi-view consistency while significantly reducing editing time.
  • Qualitative analysis reveals that the method successfully handles complex geometric transformations, motion edits, and style changes where baseline models fail due to artifacts, ghosting, or semantic misinterpretation.
  • Ablation studies confirm that depth and point confidence rewards are essential for preventing ghosting artifacts and maintaining 3D consistency, while text-based rewards ensure accurate viewpoint alignment.
  • Experiments comparing consistency verifiers show that traditional metrics like Structure-from-Motion and photometric reprojection loss lead to textureless or blurred outputs, validating the necessity of using data-driven priors for reward signals.
  • Additional tests verify that the framework generalizes effectively to unseen instructions and scenes in a zero-shot setting and can be enhanced by integrating more powerful 2D editing backbones.

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp