Command Palette
Search for a command to run...
확산 모델은 투명성을 안다: 비디오 확산 모델을 활용한 투명 물체의 깊이 및 노멀 추정
확산 모델은 투명성을 안다: 비디오 확산 모델을 활용한 투명 물체의 깊이 및 노멀 추정
초록
투명 물체는 여전히 인식 시스템에게 매우 도전적인 과제로 남아 있다. 굴절, 반사, 투과 현상은 스테레오, ToF, 그리고 순수한 분류 기반 단안 깊이 추정 기법들이 내포한 가정을 깨뜨려 빈 공간과 시간적으로 불안정한 추정을 초래한다. 우리의 핵심 관찰은 현대의 비디오 확산 모델들이 이미 신빙성 있는 투명 현상을 생성할 수 있다는 점이며, 이는 이러한 모델들이 이미 광학 법칙을 내재화하고 있음을 시사한다. 이를 바탕으로 우리는 투명/반사 시점을 포함하는 합성 비디오 코퍼스인 TransPhy3D를 구축하였다. 이 데이터셋은 Blender/Cycles를 활용해 렌더링한 총 11,000개의 시퀀스로 구성되어 있으며, 카테고리가 � rich한 정적 자산과 형태가 풍부한 프로시저럴 자산을 유리/플라스틱/금속 재질과 결합하여 구성하였다. RGB, 깊이, 노멀 정보를 물리 기반 광선 추적 및 OptiX 노이즈 제거 기법을 통해 렌더링하였다. 대규모 비디오 확산 모델을 기반으로 하여, 경량 LoRA 어댑터를 활용해 깊이(및 노멀)에 대한 비디오-투-비디오 번역기 학습을 수행하였다. 학습 과정에서는 DiT 백본 내부에서 RGB와 (노이즈가 포함된) 깊이 잠재 표현을 연결하고, TransPhy3D와 기존의 프레임 단위 합성 데이터셋을 함께 학습함으로써 임의 길이의 입력 비디오에 대해 시간적으로 일관된 예측을 가능하게 하였다. 최종적으로 도출된 모델 DKT는 투명성 관련 실재 및 합성 비디오 벤치마크에서 제로샷 최고 성능(SOTA)을 달성하였다. 대상은 ClearPose, DREDS (CatKnown/CatNovel), TransPhy3D-Test 등이다. 기존의 강력한 이미지/비디오 기반 베이스라인 모델들에 비해 정확도와 시간적 일관성 측면에서 개선되었으며, 노멀 버전은 ClearPose에서 비디오 노멀 추정 최고 성능을 기록하였다. 작고 효율적인 1.3B 파라미터 버전은 프레임당 약 0.17초의 추론 속도를 보이며, 잡기 시스템에 통합되었을 때, 투명, 반사, 확산 표면 모두에서 성공률을 크게 향상시켰으며, 기존 추정기들을 능가하였다. 이러한 결과들은 보다 포괄적인 주장 — “확산 모델은 투명성을 안다” — 를 뒷받침한다. 생성형 비디오 사전 지식은 효율적이고 레이블 없이도, 도전적인 현실 세계의 조작 작업을 위한 강력하고 시간적으로 일관된 인식 시스템으로 재활용될 수 있다.
One-sentence Summary
The authors, affiliated with Beijing Academy of Artificial Intelligence, Tsinghua University, and other institutions, propose DKT—a foundation model for video depth and normal estimation of transparent objects—by repurposing a video diffusion model via LoRA fine-tuning on TransPhy3D, a novel synthetic video dataset of 11k transparent/reflective scenes, achieving zero-shot SOTA performance on real and synthetic benchmarks, with applications in robotic grasping across translucent, reflective, and diffuse surfaces.
Key Contributions
- Transparent and reflective objects pose persistent challenges for depth estimation due to physical phenomena like refraction and reflection, which break assumptions in stereo, ToF, and monocular methods, leading to holes and temporal instability in predictions.
- The authors introduce TransPhy3D, a novel synthetic video dataset of 11,000 sequences (1.32M frames) with physically accurate rendering of transparent and reflective scenes, enabling training of video diffusion models on realistic transparent-object dynamics.
- By repurposing a large video diffusion model with lightweight LoRA adapters and co-training on TransPhy3D and existing frame-wise datasets, the proposed DKT model achieves zero-shot state-of-the-art performance on real and synthetic benchmarks, with improved accuracy and temporal consistency for video depth and normal estimation.
Introduction
Accurate depth estimation for transparent and reflective objects remains a critical challenge in 3D perception and robotic manipulation, as traditional methods relying on stereo or time-of-flight sensors fail due to refraction, reflection, and transmission. Prior data-driven approaches have been limited by small, static datasets and poor generalization, while existing video-based models struggle with temporal inconsistency. The authors introduce TransPhy3D, the first large-scale synthetic video dataset of transparent and reflective scenes—11,000 sequences (1.32M frames)—rendered with physically based ray tracing and OptiX denoising. They leverage this data to repurpose a pre-trained video diffusion model (VDM) into DKT, a foundation model for video depth and normal estimation, using lightweight LoRA adapters. By co-training on both video and frame-wise synthetic datasets, DKT achieves zero-shot state-of-the-art performance on real and synthetic benchmarks, delivering temporally coherent, high-accuracy predictions. The model runs efficiently at 0.17 seconds per frame and significantly improves robotic grasping success across diverse surface types, demonstrating that generative video priors can be effectively repurposed for robust, label-free perception of complex optical phenomena.
Dataset
- The dataset, TransPhy3D, is a novel synthetic video dataset designed for transparent and reflective objects, comprising 11,000 unique scenes with 120 frames each, totaling 1.32 million frames.
- It is built from two complementary sources: 574 high-quality static 3D assets collected from BlenderKit, filtered using Qwen2.5-VL-7B to identify transparent or reflective properties, and a procedurally generated set of parametric 3D assets that offer infinite shape variation through parameter tuning.
- A curated material library with diverse transparent (e.g., glass, plastic) and reflective (e.g., metal, glazed ceramic) materials is applied to ensure photorealistic rendering.
- Scenes are created using physics simulation: M assets are randomly selected, initialized with 6-DOF poses and scales in a predefined environment (e.g., container, tabletop), and allowed to fall and collide to achieve natural, physically plausible arrangements.
- Camera trajectories are generated as circular paths around the object’s geometric center, with sinusoidal perturbations to introduce dynamic viewpoint variation; videos are rendered using Blender’s Cycles engine with ray tracing for accurate light transport, including refraction and reflection.
- Final frames are denoised using NVIDIA OptiX-Denoiser to enhance visual quality.
- The authors use TransPhy3D to train DKT, a video depth estimation model, by fine-tuning a pretrained video diffusion model with LoRA. The dataset is used as the primary training source, with no explicit mention of data splitting or mixture ratios, but the diverse scene composition supports robust generalization.
- No cropping is applied; the full rendered frames are used. Metadata such as object categories, material types, and camera parameters are implicitly encoded in the scene generation pipeline and used during training.
Method
The authors leverage the WAN framework as the foundation for their video diffusion model, which consists of three core components: a variational autoencoder (VAE), a diffusion transformer (DiT) composed of multiple DiT blocks, and a text encoder. The VAE is responsible for compressing input videos into a lower-dimensional latent space and decoding predicted latents back into the image domain. The text encoder processes textual prompts into embeddings that guide the generation process. The diffusion transformer serves as the velocity predictor, estimating the velocity of the latent variables given noisy latents and text embeddings.
Refer to the framework diagram
to understand the overall architecture. The model operates within a flow matching framework, which unifies the denoising diffusion process. During training, a noisy latent x0 is sampled from a standard normal distribution, and a clean latent x1 is obtained from the dataset. An intermediate latent xt is then generated by linearly interpolating between x0 and x1 at a randomly sampled timestep t, as defined by the equation:
The ground truth velocity vt, representing the derivative of the interpolation path, is computed as:
vt=dtdxt=x1−x0.The training objective is to minimize the mean squared error (MSE) between the predicted velocity from the DiT model, u, and the ground truth velocity vt, resulting in the loss function:
L=Ex0,x1,ctxt,tu(xt,ctxt,t)−vt2,where ctxt denotes the text embedding.
The training strategy involves co-training on both synthetic image and video data, referred to as TransPhy3D, to improve efficiency and reduce the computational burden of rendering. As illustrated in the figure below, the process begins by sampling a frame count F for the video in the current batch using the formula:
F=4N+1N∼U(0,5).
If F equals 1, the model samples a batch of paired data from both image and video datasets, where the video consists of a single frame. Otherwise, it samples exclusively from video datasets. The pipeline then proceeds by converting the depth video in each pair into disparity. Both the RGB and depth videos are normalized to the range [−1,1] to align with the VAE's training space. These normalized videos are then encoded by the VAE into their respective latents, x1c for RGB and x1d for depth. The depth latent x1d is transformed into an intermediate latent xtd using the same interpolation scheme as the clean latent. The input to the DiT blocks is formed by concatenating xtd and x1c along the channel dimension. The training loss is computed as the MSE between the DiT's output and the ground truth velocity vtd, which is derived from the depth latent:
L=Ex0,x1d,x1c,ctxt,tu(Concat(xtd,x1c),ctxt,t)−vtd2.All model components, including the VAE and text encoder, are kept frozen during training. Only a small set of low-rank weight adaptations, implemented via LoRA, are trained within the DiT blocks to enable efficient fine-tuning.
Experiment
- Trained on synthetic datasets (HISS, DREDS, ClearGrasp, TransPhy3D) and real-world ClearPose, using AdamW with learning rate 1e-5, batch size 8, and 70K iterations on 8 H100 GPUs; inference uses 5 denoising steps with overlapping segment stitching for arbitrary-length video processing.
- On ClearPose and TransPhy3D-Test datasets, DKT achieves new SOTA performance, outperforming second-best methods by 5.69, 9.13, and 3.1 in δ₁.₀₅, δ₁.₁₀, δ₁.₂₅ on ClearPose, and 55.25, 40.53, and 9.97 on TransPhy3D-Test, demonstrating superior accuracy on transparent and reflective objects.
- DKT-1.3B achieves 167.48ms per frame inference time at 832×480 resolution, surpassing DAv2-Large by 110.27ms, with only 11.19 GB peak GPU memory, showing high efficiency suitable for robotic platforms.
- Ablation studies confirm LoRA fine-tuning outperforms naive finetuning, and 5 inference steps balance accuracy and efficiency, with no significant gain beyond this point.
- DKT-Normal-14B sets new SOTA in video normal estimation on ClearPose, significantly outperforming NormalCrafter and Marigold-E2E-FT in both accuracy and temporal consistency.
- Real-world grasping experiments on reflective, translucent, and diffusive surfaces show DKT-1.3B consistently outperforms baselines (DAv2-Large, DepthCrafter) across all settings, enabling successful robotic manipulation in complex scenes.
Results show that increasing inference steps improves performance up to a point, with 5 steps achieving the best balance—yielding the lowest REL and RMSE and the highest δ1.05, δ1.10, and δ1.25 scores. Beyond 5 steps, performance degrades, indicating diminishing returns and potential loss of detail.

The authors use a LoRA fine-tuning strategy to improve model performance, with results showing that the 14B model achieves the best performance across all metrics, including lower REL and RMSE values and higher accuracy for δ1.05, δ1.10, and δ1.25 compared to the 1.3B model. The inclusion of LoRA significantly reduces computational cost while enhancing depth estimation accuracy.

The authors compare the computational efficiency of different depth estimation models, showing that DKT-1.3B achieves the fastest inference time of 167.48ms per frame at a resolution of 832 × 480, outperforming DAv2-Large by 110.27ms. This efficiency is achieved with a peak GPU memory usage of 11.19 GB, making it suitable for real-time robotic applications.

Results show that DKT achieves the best performance on both ClearPose and TransPhy3D-Test datasets, outperforming all baseline methods in most metrics. On ClearPose, DKT achieves the lowest REL and RMSE values and ranks first in all error metrics, while on TransPhy3D-Test, it achieves the best results in REL, RMSE, and all δ metrics, demonstrating superior accuracy and consistency in depth estimation.

Results show that DKT-1.3B achieves the highest performance across all object categories in the Translucent, Reflective, and Diffusive classes, with scores of 0.80, 0.59, and 0.81 respectively, outperforming RAW, DAv2, and DepthCrafter. The model also achieves the best mean score of 0.73, indicating superior overall depth estimation accuracy.
