Command Palette
Search for a command to run...
La diffusion sait la transparence : réaffecter la diffusion vidéo pour l'estimation de la profondeur et des normales des objets transparents
La diffusion sait la transparence : réaffecter la diffusion vidéo pour l'estimation de la profondeur et des normales des objets transparents
Abstract
Les objets transparents restent particulièrement difficiles à percevoir pour les systèmes de vision : la réfraction, la réflexion et la transmission viennent briser les hypothèses sous-jacentes aux méthodes stéréo, ToF et aux approches monoculaires purement discriminatives, entraînant des trous et des estimations instables dans le temps. Notre observation clé est que les modèles modernes de diffusion vidéo sont déjà capables de synthétiser des phénomènes transparents convaincants, ce qui suggère qu’ils ont internalisé les lois optiques fondamentales. Nous proposons TransPhy3D, un corpus synthétique de vidéos d’scènes transparentes et réfléchissantes : 11 000 séquences rendues à l’aide de Blender/Cycles. Ces scènes sont construites à partir d’une banque soigneusement sélectionnée d’objets statiques riches par catégorie et d’objets procéduraux riches en formes, associés à des matériaux verre, plastique ou métal. Nous rendons des images RGB, des cartes de profondeur et des cartes de normales à l’aide d’un traçage de rayons basé sur la physique et d’un débruitage OptiX. À partir d’un grand modèle de diffusion vidéo, nous apprenons un traducteur vidéo-à-vidéo pour la profondeur (et les normales) via des adaptateurs LoRA légers. Pendant l’entraînement, nous concaténons les latents RGB et profondeur (bruités) dans le noyau DiT, et entraînons simultanément sur TransPhy3D et des jeux de données synthétiques existants par trames, ce qui permet d’obtenir des prédictions temporellement cohérentes pour des vidéos d’entrée de longueur arbitraire. Le modèle résultant, DKT, atteint un état de l’art zéro-shot sur des benchmarks réels et synthétiques de vidéos impliquant la transparence : ClearPose, DREDS (CatKnown/CatNovel) et TransPhy3D-Test. Il améliore à la fois la précision et la cohérence temporelle par rapport à des baselines fortes en image et vidéo, et une variante utilisant les normales établit le meilleur résultat en estimation de normales vidéo sur ClearPose. Une version compacte de 1,3 milliard de paramètres s’exécute à environ 0,17 seconde par trame. Intégrée dans une chaîne de saisie, la profondeur fournie par DKT améliore les taux de réussite sur des surfaces translucides, réfléchissantes et diffuses, dépassant les estimateurs antérieurs. Collectivement, ces résultats appuient une affirmation plus large : « La diffusion sait la transparence. » Les priori vidéo génératifs peuvent être réutilisés de manière efficace et sans étiquetage pour obtenir une perception robuste et cohérente dans le temps, adaptée aux défis de la manipulation dans le monde réel.
One-sentence Summary
The authors, affiliated with Beijing Academy of Artificial Intelligence, Tsinghua University, and other institutions, propose DKT—a foundation model for video depth and normal estimation of transparent objects—by repurposing a video diffusion model via LoRA fine-tuning on TransPhy3D, a novel synthetic video dataset of 11k transparent/reflective scenes, achieving zero-shot SOTA performance on real and synthetic benchmarks, with applications in robotic grasping across translucent, reflective, and diffuse surfaces.
Key Contributions
- Transparent and reflective objects pose persistent challenges for depth estimation due to physical phenomena like refraction and reflection, which break assumptions in stereo, ToF, and monocular methods, leading to holes and temporal instability in predictions.
- The authors introduce TransPhy3D, a novel synthetic video dataset of 11,000 sequences (1.32M frames) with physically accurate rendering of transparent and reflective scenes, enabling training of video diffusion models on realistic transparent-object dynamics.
- By repurposing a large video diffusion model with lightweight LoRA adapters and co-training on TransPhy3D and existing frame-wise datasets, the proposed DKT model achieves zero-shot state-of-the-art performance on real and synthetic benchmarks, with improved accuracy and temporal consistency for video depth and normal estimation.
Introduction
Accurate depth estimation for transparent and reflective objects remains a critical challenge in 3D perception and robotic manipulation, as traditional methods relying on stereo or time-of-flight sensors fail due to refraction, reflection, and transmission. Prior data-driven approaches have been limited by small, static datasets and poor generalization, while existing video-based models struggle with temporal inconsistency. The authors introduce TransPhy3D, the first large-scale synthetic video dataset of transparent and reflective scenes—11,000 sequences (1.32M frames)—rendered with physically based ray tracing and OptiX denoising. They leverage this data to repurpose a pre-trained video diffusion model (VDM) into DKT, a foundation model for video depth and normal estimation, using lightweight LoRA adapters. By co-training on both video and frame-wise synthetic datasets, DKT achieves zero-shot state-of-the-art performance on real and synthetic benchmarks, delivering temporally coherent, high-accuracy predictions. The model runs efficiently at 0.17 seconds per frame and significantly improves robotic grasping success across diverse surface types, demonstrating that generative video priors can be effectively repurposed for robust, label-free perception of complex optical phenomena.
Dataset
- The dataset, TransPhy3D, is a novel synthetic video dataset designed for transparent and reflective objects, comprising 11,000 unique scenes with 120 frames each, totaling 1.32 million frames.
- It is built from two complementary sources: 574 high-quality static 3D assets collected from BlenderKit, filtered using Qwen2.5-VL-7B to identify transparent or reflective properties, and a procedurally generated set of parametric 3D assets that offer infinite shape variation through parameter tuning.
- A curated material library with diverse transparent (e.g., glass, plastic) and reflective (e.g., metal, glazed ceramic) materials is applied to ensure photorealistic rendering.
- Scenes are created using physics simulation: M assets are randomly selected, initialized with 6-DOF poses and scales in a predefined environment (e.g., container, tabletop), and allowed to fall and collide to achieve natural, physically plausible arrangements.
- Camera trajectories are generated as circular paths around the object’s geometric center, with sinusoidal perturbations to introduce dynamic viewpoint variation; videos are rendered using Blender’s Cycles engine with ray tracing for accurate light transport, including refraction and reflection.
- Final frames are denoised using NVIDIA OptiX-Denoiser to enhance visual quality.
- The authors use TransPhy3D to train DKT, a video depth estimation model, by fine-tuning a pretrained video diffusion model with LoRA. The dataset is used as the primary training source, with no explicit mention of data splitting or mixture ratios, but the diverse scene composition supports robust generalization.
- No cropping is applied; the full rendered frames are used. Metadata such as object categories, material types, and camera parameters are implicitly encoded in the scene generation pipeline and used during training.
Method
The authors leverage the WAN framework as the foundation for their video diffusion model, which consists of three core components: a variational autoencoder (VAE), a diffusion transformer (DiT) composed of multiple DiT blocks, and a text encoder. The VAE is responsible for compressing input videos into a lower-dimensional latent space and decoding predicted latents back into the image domain. The text encoder processes textual prompts into embeddings that guide the generation process. The diffusion transformer serves as the velocity predictor, estimating the velocity of the latent variables given noisy latents and text embeddings.
Refer to the framework diagram
to understand the overall architecture. The model operates within a flow matching framework, which unifies the denoising diffusion process. During training, a noisy latent x0 is sampled from a standard normal distribution, and a clean latent x1 is obtained from the dataset. An intermediate latent xt is then generated by linearly interpolating between x0 and x1 at a randomly sampled timestep t, as defined by the equation:
The ground truth velocity vt, representing the derivative of the interpolation path, is computed as:
vt=dtdxt=x1−x0.The training objective is to minimize the mean squared error (MSE) between the predicted velocity from the DiT model, u, and the ground truth velocity vt, resulting in the loss function:
L=Ex0,x1,ctxt,tu(xt,ctxt,t)−vt2,where ctxt denotes the text embedding.
The training strategy involves co-training on both synthetic image and video data, referred to as TransPhy3D, to improve efficiency and reduce the computational burden of rendering. As illustrated in the figure below, the process begins by sampling a frame count F for the video in the current batch using the formula:
F=4N+1N∼U(0,5).
If F equals 1, the model samples a batch of paired data from both image and video datasets, where the video consists of a single frame. Otherwise, it samples exclusively from video datasets. The pipeline then proceeds by converting the depth video in each pair into disparity. Both the RGB and depth videos are normalized to the range [−1,1] to align with the VAE's training space. These normalized videos are then encoded by the VAE into their respective latents, x1c for RGB and x1d for depth. The depth latent x1d is transformed into an intermediate latent xtd using the same interpolation scheme as the clean latent. The input to the DiT blocks is formed by concatenating xtd and x1c along the channel dimension. The training loss is computed as the MSE between the DiT's output and the ground truth velocity vtd, which is derived from the depth latent:
L=Ex0,x1d,x1c,ctxt,tu(Concat(xtd,x1c),ctxt,t)−vtd2.All model components, including the VAE and text encoder, are kept frozen during training. Only a small set of low-rank weight adaptations, implemented via LoRA, are trained within the DiT blocks to enable efficient fine-tuning.
Experiment
- Trained on synthetic datasets (HISS, DREDS, ClearGrasp, TransPhy3D) and real-world ClearPose, using AdamW with learning rate 1e-5, batch size 8, and 70K iterations on 8 H100 GPUs; inference uses 5 denoising steps with overlapping segment stitching for arbitrary-length video processing.
- On ClearPose and TransPhy3D-Test datasets, DKT achieves new SOTA performance, outperforming second-best methods by 5.69, 9.13, and 3.1 in δ₁.₀₅, δ₁.₁₀, δ₁.₂₅ on ClearPose, and 55.25, 40.53, and 9.97 on TransPhy3D-Test, demonstrating superior accuracy on transparent and reflective objects.
- DKT-1.3B achieves 167.48ms per frame inference time at 832×480 resolution, surpassing DAv2-Large by 110.27ms, with only 11.19 GB peak GPU memory, showing high efficiency suitable for robotic platforms.
- Ablation studies confirm LoRA fine-tuning outperforms naive finetuning, and 5 inference steps balance accuracy and efficiency, with no significant gain beyond this point.
- DKT-Normal-14B sets new SOTA in video normal estimation on ClearPose, significantly outperforming NormalCrafter and Marigold-E2E-FT in both accuracy and temporal consistency.
- Real-world grasping experiments on reflective, translucent, and diffusive surfaces show DKT-1.3B consistently outperforms baselines (DAv2-Large, DepthCrafter) across all settings, enabling successful robotic manipulation in complex scenes.
Results show that increasing inference steps improves performance up to a point, with 5 steps achieving the best balance—yielding the lowest REL and RMSE and the highest δ1.05, δ1.10, and δ1.25 scores. Beyond 5 steps, performance degrades, indicating diminishing returns and potential loss of detail.

The authors use a LoRA fine-tuning strategy to improve model performance, with results showing that the 14B model achieves the best performance across all metrics, including lower REL and RMSE values and higher accuracy for δ1.05, δ1.10, and δ1.25 compared to the 1.3B model. The inclusion of LoRA significantly reduces computational cost while enhancing depth estimation accuracy.

The authors compare the computational efficiency of different depth estimation models, showing that DKT-1.3B achieves the fastest inference time of 167.48ms per frame at a resolution of 832 × 480, outperforming DAv2-Large by 110.27ms. This efficiency is achieved with a peak GPU memory usage of 11.19 GB, making it suitable for real-time robotic applications.

Results show that DKT achieves the best performance on both ClearPose and TransPhy3D-Test datasets, outperforming all baseline methods in most metrics. On ClearPose, DKT achieves the lowest REL and RMSE values and ranks first in all error metrics, while on TransPhy3D-Test, it achieves the best results in REL, RMSE, and all δ metrics, demonstrating superior accuracy and consistency in depth estimation.

Results show that DKT-1.3B achieves the highest performance across all object categories in the Translucent, Reflective, and Diffusive classes, with scores of 0.80, 0.59, and 0.81 respectively, outperforming RAW, DAv2, and DepthCrafter. The model also achieves the best mean score of 0.73, indicating superior overall depth estimation accuracy.
