HyperAIHyperAI

Command Palette

Search for a command to run...

幾何学誘導型強化学習による多視点整合性を持つ 3D シーン編集

概要

2 次元拡散モデルの事前知識を活用した 3 次元編集は、有望なパラダイムとして台頭している。しかしながら、編集結果における多視点一貫性の維持は依然として困難であり、3 次元一貫性を備えた編集対データが極めて希少であるため、編集タスクにおいて最も効果的な訓練戦略である教師あり微調整(SFT)の実施は非現実的である。本研究では、多視点一貫性を有する 3 次元コンテンツの生成は極めて困難である一方、3 次元一貫性の検証は実行可能であることを観察し、これにより強化学習(RL)が実現可能な解決策として自然に位置づけられることに着目した。この知見に基づき、本研究では、3 次元基盤モデルである VGGT から導出された新規報酬を用いた RL 最適化によって駆動される単一パスフレームワーク「RL3DEdit」を提案する。具体的には、VGGT が膨大な実世界データから学習した堅牢な事前知識を活用し、編集された画像を入力として与えることで、出力の信頼度マップおよび姿勢推定誤差を報酬信号として利用し、強化学習を通じて 2 次元編集の事前知識を 3 次元一貫性多様体に効果的に定着させる。広範な実験により、RL3DEdit が安定した多視点一貫性を達成し、編集品質において最先端手法を上回るとともに、高い効率性を有することが示された。3 次元編集の発展を促進するため、コードおよびモデルを公開する予定である。

One-sentence Summary

Researchers from BJTU, AMap Alibaba Group, NTU, and CQUPT propose RL3DEdit, a single-pass 3D editing framework that leverages reinforcement learning with VGGT-derived rewards to ensure multi-view consistency, overcoming data scarcity and outperforming iterative methods in both quality and efficiency for AR and gaming applications.

Key Contributions

  • Current 3D editing methods struggle with multi-view consistency and cannot utilize supervised fine-tuning due to the extreme scarcity of 3D-consistent paired data.
  • The proposed RL3DEdit framework leverages the 3D foundation model VGGT to generate novel reward signals from confidence maps and pose errors, enabling RL optimization that anchors 2D editing priors onto a 3D-consistent manifold without requiring paired datasets.
  • Extensive experiments demonstrate that this single-pass approach achieves state-of-the-art editing quality and stable multi-view consistency while operating more than twice as fast as previous iterative optimization methods.

Introduction

3D scene editing is critical for AR/VR and gaming applications, yet current methods struggle to maintain geometric coherence while leveraging powerful 2D diffusion models. Prior approaches suffer from inefficiency due to iterative optimization, produce blurry artifacts from inconsistent signals, or fail to handle edits that alter scene geometry because they rely on depth maps or attention propagation. The authors address these challenges by introducing RL3DEdit, a single-pass framework that uses reinforcement learning to optimize 2D editors for 3D consistency without requiring scarce paired training data. They leverage the 3D foundation model VGGT as a robust verifier to generate geometry-aware reward signals, effectively anchoring 2D editing priors onto a 3D-consistent manifold while achieving state-of-the-art quality and over twice the speed of existing methods.

Method

The authors propose RL3DEdit, a framework that leverages Reinforcement Learning to equip a 2D foundation model with 3D-consistency priors. The overall architecture is illustrated in the framework diagram. Given a 3D asset, the system first renders it from MMM viewpoints to obtain a set of images {Im}m=1M\{I_m\}_{m=1}^M{Im}m=1M. These images are fed simultaneously into a 2D editor, denoted as π\piπ, for joint multi-view editing. During inference, the fine-tuned editor produces multi-view consistent images in a single forward pass, which are subsequently processed by 3D Gaussian Splatting (3DGS) reconstruction to yield the final edited 3D scene.

To address the core challenge of ensuring 3D consistency without paired supervision, the authors employ the Group Relative Policy Optimization (GRPO) algorithm. During training, the system explores a group of GGG edited results through independent inference passes. A dedicated 3D-aware reward model, implemented via VGGT, is utilized to explicitly enforce both editing faithfulness and multi-view coherence. This model jointly assesses three critical aspects of multi-view consistency, represented as depth confidence rDr^DrD, point confidence rPr^PrP, and relative pose reward rTr^TrT, alongside an editing quality term rar^ara. These complementary rewards are combined to form the final composite reward RiR^iRi, which guides the optimization toward consistent and high-quality 3D-aware editing.

The choice of the 2D backbone is critical for enabling cross-view interaction. The authors adopt FLUX-Kontext, a DiT-based model that naturally supports multi-image joint editing through global attention mechanisms. This capability allows the model to process all input views as a concatenated sequence, facilitating the necessary cross-view interactions for 3D consistency. The versatility of this approach is demonstrated by the diverse editing capabilities shown in the qualitative results, which include motion, replacement, style transfer, background modification, and object addition.

A key component of the training process is the anchor reward, designed to preserve the original 2D editing fidelity of the foundation model. By comparing the edited anchor view against a pre-computed high-quality 2D edit, the model ensures that semantic correctness and visual details are maintained while learning 3D priors. As illustrated in the comparison of editing capabilities, the RL fine-tuned model successfully preserves the original 2D editing fidelity of the base model, as evidenced by the comparable VIE Score metrics.

Finally, the relative advantage AiA^iAi is computed from the group rewards, and the model is optimized by maximizing the following objective:

J(θ)=Jclip(θ)βDKL(πθπref)J(\theta) = J_{\text{clip}}(\theta) - \beta D_{\text{KL}}(\pi_\theta || \pi_{\text{ref}})J(θ)=Jclip(θ)βDKL(πθ∣∣πref)

where πθ\pi_\thetaπθ and πref\pi_{\text{ref}}πref denote the fine-tuned and original 2D editors, respectively. This formulation allows the model to learn 3D-consistency priors effectively without requiring curated paired data.

Experiment

  • Comparative experiments against state-of-the-art 3D editing methods demonstrate that the proposed approach achieves superior instruction following, visual fidelity, and multi-view consistency while significantly reducing editing time.
  • Qualitative analysis reveals that the method successfully handles complex geometric transformations, motion edits, and style changes where baseline models fail due to artifacts, ghosting, or semantic misinterpretation.
  • Ablation studies confirm that depth and point confidence rewards are essential for preventing ghosting artifacts and maintaining 3D consistency, while text-based rewards ensure accurate viewpoint alignment.
  • Experiments comparing consistency verifiers show that traditional metrics like Structure-from-Motion and photometric reprojection loss lead to textureless or blurred outputs, validating the necessity of using data-driven priors for reward signals.
  • Additional tests verify that the framework generalizes effectively to unseen instructions and scenes in a zero-shot setting and can be enhanced by integrating more powerful 2D editing backbones.

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています