Command Palette
Search for a command to run...
ワールド・トレーシング:可視領域を超えた生成ピクセルアラインドジオメトリ
ワールド・トレーシング:可視領域を超えた生成ピクセルアラインドジオメトリ
Hao Zhang Mohamed El Banani Jen-Hao Cheng Paul Zhang Yi Hua Ben Mildenhall Christoph Lassner Narendra Ahuja Gengshan Yang
概要
Image-to-3D 手法は、忠実性と完全性の間でトレードオフを起こすことが多い。深度推定器は入力画素に依存するが可視表面で止まり、一方、画像から3Dへのモデルは入力と整合しないことが多い完全形状を生成する。われわれは、World Tracing と呼ばれる生成型画素アライメント幾何表現を導入する。これは、観測された画素と整合する 3D 点を予測しつつ、可視表面を超えた幾何形状も補完する。各入力画素に対して、World Tracing はカメラ空間の 3D 点の順序付けられたスタックを予測し、スタックの第 1 層は可視表面を表し、その後の層は隠蔽面との前後方向の交点を表す。この表現を世界追跡拡散 Transformer (WT-DiT) で具現化し、複数の幾何層を、因数分解された注意機構とグローバル注意機構を通じて結合された個別のノイズ除去トークンとして扱う。WT-DiT は、画素空間におけるフローマッチングと、可視表面の再構築と隠蔽幾何形状の生成のバランスを取る混合ノイズスケジューリングにより訓練される。World Tracing は、オブジェクト、シーン、動的多様なベンチマークにおいて、可視表面の再構築と完全な幾何形状の生成の両面で優れた性能を発揮し、深度予測器および画像から 3D への生成器の両者を上回る。さらに、2D から 3D への対応関係を保証するため、テキスト駆動型 3D シーン編集、幾何条件付き新規ビュー動画合成、テクスチャ付きメッシュ生成器とのトレーニング不要の統合を可能にする。
One-sentence Summary
World Tracing, a generative pixel-aligned geometry representation instantiated with a world-tracing diffusion transformer (WT-DiT) trained with pixel-space flow matching and a mixed noise schedule, predicts ordered stacks of camera-space 3D points per input pixel representing visible and occluded surfaces, outperforming depth predictors and image-to-3D generators on object, scene, and dynamic benchmarks while enabling text-driven 3D scene editing, geometry-conditioned novel-view video synthesis, and training-free integration with textured-mesh generators.
Key Contributions
- This work introduces World Tracing, a generative pixel-aligned geometry representation designed to resolve the trade-off between faithfulness and completeness by predicting ordered stacks of camera-space 3D points. The method instantiates this representation with a diffusion transformer that couples multiple geometry layers through factorized and global attention and trains using pixel-space flow matching.
- Comprehensive evaluations across objects, scenes, and videos demonstrate that multilayer generation improves complete geometry while maintaining visible-surface accuracy. The approach outperforms both depth predictors and image-to-3D generators on these benchmarks.
- The system preserves 2D-to-3D correspondence to enable downstream tasks like text-driven 3D scene editing and geometry-conditioned novel-view video synthesis. Additional demonstrations show training-free integration with textured-mesh generators, validating the model's utility as a geometry prior.
Introduction
Single-image 3D estimation typically forces a choice between faithful visible surface reconstruction and complete geometry generation. Existing depth estimators align with input pixels but fail to model occluded surfaces, whereas image-to-3D models generate full shapes at the cost of pixel alignment. This limitation hinders downstream applications like 3D editing and novel-view synthesis that require consistent camera-space geometry. To address this, the authors introduce World Tracing, a generative pixel-aligned geometry representation predicting an ordered stack of 3D points per input pixel to capture both visible and hidden surfaces. They instantiate this with a diffusion transformer trained via flow matching to unify high-fidelity surface estimation with plausible occluded geometry completion.
Dataset
Dataset Composition and Sources
- The authors build three corpora for multilayer 3D-asset training containing approximately 300K objects, scene frames, and 16.8K dynamic clips.
- Object data sources include Objaverse-XL, Objaverse, 3D-FUTURE, Toys4k, GSO, and TrueBones, resulting in roughly 17M rendered views.
- Scene data combines the public 3D-FRONT corpus with a held-out internal scene set used as a generalization probe.
- Dynamic assets consist of Objaverse-XL animated subsets and rigged characters from TrueBones sampled as short clips.
- A supplementary 12-dataset RGBD-style corpus supports mix-training with real photographs from ScanNet v2, MegaDepth, and Waymo Open alongside synthetic sets like Hypersim and Taskonomy.
Key Details for Each Subset
- Renderings utilize randomized lighting, viewpoints, and intrinsics to ensure diversity.
- Ground-truth layers derive from depth peeling to record ordered front-to-back surface intersections.
- Forward-filling rules populate rays with fewer than L intersections to create dense L-layer XYZ targets.
- The RGBD-style corpus provides supervision for the visible surface layer L0 only, whereas the multilayer corpus shapes layers L1 through L5.
Model Usage and Training Mixture
- Mini-batches sample from the 3D-asset corpus with 60% probability and the RGBD-style corpus with 40% probability.
- RGBD datasets are weighted by the square root of row counts with a 1.5x boost for real-photograph sets.
- Training occurs at 504x504 resolution with a global batch size of 512 optimized via AdamW.
- Evaluation relies on held-out public benchmarks including 100 unique objects, the 3D-FRONT split, and dynamic benchmarks like Obj-Val and Truebone.
Processing and Augmentation Strategies
- Online augmentations apply random crops, flips, and affine perturbations to images and camera-space targets simultaneously.
- Photometric jitter adjusts brightness, contrast, and saturation while alpha masks undergo random dilation or erosion.
- Dynamic clips maintain geometric and mask-space augmentation consistency across frames to preserve temporal integrity.
- Depth values convert to camera-space XYZ using original intrinsics and normalize via a per-sample log-median map.
Method
The authors present an image-to-3D pipeline designed to produce L layer pointmaps from a single input image. The core of this approach is a pixel-aligned representation where geometry is expressed as a tensor X∈RL×H×W×3, with each layer corresponding to a front-to-back intersection along a pixel ray. To facilitate training, the authors employ a dense supervision strategy where empty intersections in deeper layers are filled from the nearest valid earlier layer, ensuring every valid input ray supervises all layers.
The overall framework is illustrated in the provided diagram. The process begins with two inputs: an RGBA image (H,W,4) and a noisy multilayer XYZ stack (L,H,W,5). The visual information is processed by a frozen MoGe ViT-L encoder, which generates pixel-aligned image tokens. Simultaneously, the noisy geometry is patchified into geometry tokens. These two streams of tokens are then fused together.
The fused tokens are fed into a stack of WT-DiT blocks, which serve as the core diffusion transformer. This module incorporates a specialized three-way attention factorization to handle the multilayer structure efficiently. First, layer-wise attention allows each layer to attend to itself as a 2D image. Second, ray-wise attention enables tokens at the same pixel to attend along the front-to-back layer axis, enforcing depth ordering and coherence. Third, global attention recovers object or scene-level context. For dynamic clips, a temporal attention block is also inserted to handle time-axis dependencies.
Conditioning is applied through embeddings for the timestep and layer identifier. These embeddings are modulated into the network via FiLM layers, which break the layer permutation symmetry without requiring learnable additive position tokens. The network is trained using a flow-matching objective directly on the normalized coordinate signal, predicting the clean endpoint x^0. Finally, the output tokens are unpatchified to produce the final denoised multilayer XYZ prediction, preserving the camera-space coordinates and pixel-to-3D correspondences.
Experiment
The experiments verify that a pixel-aligned multilayer paradigm yields more faithful 3D geometry across objects, scenes, and dynamic clips while benefiting downstream pipelines dependent on pose or disoccluded structure. Qualitative comparisons indicate the model generates complete geometry without sacrificing visible-surface accuracy, outperforming regression-based and canonical-frame baselines by better modeling occluded surfaces and preserving planar structures on out-of-distribution inputs. Downstream demonstrations confirm that this camera-space representation enables robust object insertion, textured mesh generation, and view synthesis, establishing the multilayer approach as a superior prior for real-world geometric tasks.
The the the table displays per-layer statistics regarding valid pixel coverage and view counts for a multilayer geometry model. Results indicate a consistent downward trend in both mean coverage and valid views as the layer depth increases from the surface to deeper occluded regions. Mean valid coverage is highest for the initial layer and diminishes substantially in subsequent layers. The number of valid views decreases steadily as the layer index moves from top to bottom. There is a marked disparity in data density between the visible surface layers and the deeper geometric layers.
The authors evaluate different diffusion timestep schedules to determine their impact on multilayer geometry generation quality. While specific schedules favor either the visible surface or deeper occluded layers, combining them into a mixture yields the most consistent results across the entire geometry stack. The plateaued logit-normal schedule outperforms the standard schedule on the visible Layer 0. The standard logit-normal schedule is more effective for deeper layers compared to the plateaued variant. A mixture of both distributions achieves the best overall performance across all layers.
The authors evaluate scene geometry generation on the 3D-FRONT benchmark and an internal test set. The proposed WT-S method achieves superior performance in both visible-surface depth accuracy and complete multilayer geometry reconstruction compared to existing baselines like LaRI-scene and MoGe-2. WT-S achieves the best visible-surface depth and L0 point-cloud geometry among all compared methods. The method substantially improves all-layer geometry metrics over single-layer baselines that do not produce occluded layers. The full model trained on a broader corpus outperforms the version trained solely on 3D-FRONT on the held-out benchmark.
The the the table compares the proposed WT-O method against several baselines for object geometry generation. Results indicate that WT-O achieves the highest accuracy and completeness, outperforming methods that generate meshes or 3D Gaussians. The data supports the claim that the pixel-aligned multilayer paradigm yields more faithful 3D geometry across objects. WT-O achieves the best performance metrics across all measured categories compared to other methods. The proposed approach generates a point cloud output that surpasses mesh-based and 3D Gaussian baselines in geometry faithfulness. Baselines such as TRELLIS.2 and ReconViaGen show higher error rates and lower fidelity scores in this evaluation.
The authors evaluate dynamic geometry generation using a proposed model against several baselines across object, rigged, and action-driven splits. The proposed method achieves the best overall mean performance and leads on object validation and rigged asset benchmarks. ActionMesh demonstrates superior performance specifically on the action-driven benchmark where ground truth matches its native output format. The proposed model achieves the best mean performance across all dynamic geometry benchmarks. Results show superior accuracy on object validation and rigged asset evaluations compared to diffusion and regression baselines. ActionMesh remains the top performer on action-driven motion tasks due to native format alignment with ground truth.
Experiments evaluate a multilayer geometry model across scene, object, and dynamic generation tasks, revealing significant data density disparities between visible surfaces and deeper occluded regions. Evaluation of diffusion schedules shows that combining distributions yields consistent results across layers, while the proposed methods outperform existing baselines in reconstruction accuracy and completeness. The results validate that the pixel-aligned multilayer paradigm yields more faithful 3D geometry than traditional mesh or Gaussian approaches across the evaluated scenarios.