HyperAIHyperAI

Command Palette

Search for a command to run...

生成 prior を用いた決定論的ビデオ深度推定

DVD:生成事前情報に基づく決定論的视频深さ推定

ノートブックへ移動

概要

既存の動画深度推定には、根本的なトレードオフが存在する。生成モデルは確率的な幾何学的ハルシネーションやスケールのドリフトを引き起こす一方で、判別モデルは意味的な曖昧さを解消するために膨大なラベル付きデータを必要とする。この停滞を打破するため、本研究ではDVDを提案する。DVDは、事前学習済み動画Diffusionモデルを単一パスの深度回帰器に決定論的に適応させる初のフレームワークである。具体的には、DVDは以下の3つの核心的な設計を特徴とする:(i) 拡散タイムステップを構造的アンカーとして再利用し、グローバルな安定性と高周波詳細のバランスを取ること;(ii) ラテント多様体是正(Latent Manifold Rectification: LMR)により、回帰由来の過度な平滑化を緩和し、微分制約を課することで鋭い境界と一貫性のある動きを復元すること;(iii) グローバルアフィンの一貫性であり、これはウィンドウ間の発散を制限する内在的特性であり、複雑な時間的アライメントを必要とせずにシームレスな長時間動画推論を可能にする。広範な実験により、DVDがベンチマークにおいて最先端(SOTA)のゼロショットパフォーマンスを達成することが示された。さらに、DVDは、主要なベースラインと比較して163倍少ないタスク固有データを用いて、動画フーンドモデルに内蔵された深い幾何学的事前知識を解き放つことに成功した。特筆すべきは、我々はパイプライン全体を完全にオープンソース化し、オープンソースコミュニティに貢献するため、SOTA動画深度推定のための完全なトレーニングスイートを公開することである。

One-sentence Summary

The authors propose DVD, a framework that deterministically adapts pre-trained video diffusion models into single-pass depth regressors to address stochastic geometric hallucinations and massive labeled dataset requirements through latent manifold rectification and global affine coherence, achieving state-of-the-art zero-shot performance with 163× less task-specific data than leading baselines while unlocking profound geometric priors.

Key Contributions

  • DVD is the first framework to deterministically adapt pre-trained video diffusion models into single-pass depth regressors. This approach bypasses stochastic sampling to resolve the ambiguity-hallucination dilemma while uniting semantic richness with structural stability.
  • Three core designs include repurposing the diffusion timestep as a structural anchor and employing latent manifold rectification to enforce differential constraints on motion. Global affine coherence enables seamless long-video inference without requiring complex temporal alignment.
  • Extensive experiments demonstrate state-of-the-art zero-shot performance across benchmarks using 163× less task-specific data than leading baselines. The complete training pipeline is fully released to benefit the open-source community.

Introduction

Video depth estimation serves as a foundational component for 3D scene understanding in fields ranging from autonomous driving to robotic manipulation. Current approaches face a fundamental trade-off where generative models suffer from stochastic geometric hallucinations and discriminative models demand massive labeled datasets to resolve semantic ambiguities. The authors introduce DVD, the first framework to deterministically adapt pre-trained video diffusion models into single-pass depth regressors. This approach leverages a timestep structural anchor and latent manifold rectification to balance global stability with high-frequency details, enabling state-of-the-art zero-shot performance with 163 times less task-specific data than leading baselines.

Method

The authors propose a novel framework for video depth estimation that bridges the gap between generative priors and deterministic stability. The method operates within a compressed latent manifold to leverage the rich semantic knowledge of large-scale pre-trained models while ensuring geometric consistency.

The overall pipeline begins by formalizing video depth estimation as a mapping from an input RGB sequence xRF×3×H×Wx \in \mathbb{R}^{F \times 3 \times H \times W}xRF×3×H×W to a depth sequence dRF×H×Wd \in \mathbb{R}^{F \times H \times W}dRF×H×W. To exploit pre-trained capabilities, a frozen variational autoencoder (VAE) encoder E\mathcal{E}E projects both RGB and depth data into a unified latent space, defined as zx=E(x)z_x = \mathcal{E}(x)zx=E(x) and zd=E(d)z_d = \mathcal{E}(d)zd=E(d). The core objective is to learn a deterministic mapping Φ:zxzd\Phi: z_x \mapsto z_dΦ:zxzd that recovers geometric structure directly in this latent space, with the final depth reconstructed via a frozen VAE decoder.

Refer to the framework diagram for a visualization of the complete architecture, which encompasses both the training adaptation and the long-video inference stages.

In the deterministic adaptation phase, the authors repurpose a pre-trained Video Diffusion Transformer (Video DiT) as a one-step regressor rather than performing iterative stochastic denoising. Instead of solving an ordinary differential equation over a noise trajectory, the network Fθ\mathcal{F}_\thetaFθ performs a direct functional mapping z^d=Fθ(zx,τ)\hat{z}_d = \mathcal{F}_\theta(z_x, \tau)z^d=Fθ(zx,τ) in a single forward pass. A critical design choice involves the conditioning timestep τ\tauτ. Unlike standard diffusion models where ttt parameterizes noise levels during generation, the authors leverage τ\tauτ as a structural anchor. By fixing the timestep to an optimal state τ0\tau_0τ0 via a sinusoidal embedding, the model is calibrated to a specific geometric operating regime. This frequency-parameterized conditioning balances low-frequency global stability with high-frequency local detail recovery, preventing the over-smoothing often observed when adapting diffusion backbones to deterministic tasks.

To address the issue of mean collapse inherent in regression-based training, the framework introduces Latent Manifold Rectification (LMR). Standard point-wise losses tend to drive predictions toward the conditional expectation, washing out high-frequency structural details and causing temporal flickering. LMR counteracts this by enforcing differential consistency in the latent space without requiring auxiliary modules. The supervision strategy aligns the spatial gradients and temporal flows of the predicted latents with the ground truth. The spatial rectification loss Lsp\mathcal{L}_{sp}Lsp penalizes low-frequency collapse to preserve sharp boundaries, while the temporal rectification loss Ltemp\mathcal{L}_{temp}Ltemp synchronizes inter-frame dynamics to ensure coherent motion. These terms are combined with a global L2L_2L2 loss to form the total video objective, effectively preserving latent high-frequency structures against the smoothing effects of deterministic regression.

For long-duration video inference, memory constraints necessitate a sliding-window approach. While the deterministic backbone eliminates stochastic scale drift, the VAE decoder's context-dependent normalization can still induce fluctuations in depth values across windows. To resolve this, the authors exploit an inherent property of global affine coherence, observing that inter-window discrepancies can be approximated by linear scale-shift transformations. During inference, the system employs a Least-Squares Solver to estimate a global scale sss and shift ttt based on the overlap between adjacent windows. This affine calibration is broadcast to the entire current window, allowing for seamless blending of overlapping frames. This strategy enables robust, flicker-free depth estimation for long videos without requiring complex feature matching or recurrent temporal modules. Finally, the model is optimized via an image-video joint training strategy, where static images act as high-frequency spatial anchors while video sequences enforce temporal coherence, ensuring both spatial sharpness and temporal stability.

Experiment

The evaluation utilizes standard video and image depth benchmarks to compare the proposed DVD method against state-of-the-art generative and discriminative baselines across tasks involving temporal consistency, boundary accuracy, and single-image generalization. Results indicate that DVD achieves superior geometric fidelity and long-term temporal coherence while maintaining competitive inference speeds and requiring significantly less training data than comparable generative models. Further analysis confirms that deterministic adaptation and joint image-video training are critical for preventing geometric hallucinations and ensuring robust scalability across diverse open-world scenes.

The authors compare their DVD method against ChronoDepth, DepthCrafter, and VDA across KITTI, DIODE, and NYUv2 datasets. Results indicate that DVD achieves the best performance on KITTI and DIODE, outperforming both discriminative and generative baselines. On the NYUv2 dataset, the method demonstrates competitive accuracy, closely trailing the top-performing VDA while significantly surpassing other approaches. DVD achieves the highest accuracy on KITTI and DIODE benchmarks compared to all listed baselines. The proposed method significantly outperforms generative models like DepthCrafter across all evaluated datasets. DVD demonstrates robust single-image generalization, performing competitively against the strong VDA baseline on NYUv2.

The authors analyze the impact of overlap size on model performance and efficiency. Results indicate that increasing the overlap size consistently improves geometric accuracy and threshold metrics. However, this enhancement in quality is accompanied by a significant increase in relative inference time. Absolute relative error decreases as overlap size increases. Threshold accuracy improves with larger overlap configurations. Relative inference time rises steadily with larger overlap settings.

The authors evaluate DVD against state-of-the-art video depth estimation methods across standard benchmarks including Bonn, ScanNet, and KITTI. Results show that DVD consistently achieves superior geometric fidelity and temporal coherence compared to both generative and discriminative baselines. The method demonstrates significant efficiency gains by utilizing deterministic adaptation instead of iterative sampling. DVD achieves top-tier accuracy across all tested datasets, outperforming methods like DepthCrafter and VDA. The deterministic adaptation approach bypasses the computational bottlenecks associated with iterative generative sampling. Joint image-video training ensures the model retains high spatial precision while maintaining temporal consistency.

The the the table presents an ablation study evaluating the impact of LoRA rank on model performance. Results show that increasing the rank from the lowest setting significantly reduces error and improves accuracy metrics. However, moving from the medium to the highest rank yields diminishing returns, suggesting a moderate rank is sufficient for optimal performance. Increasing the LoRA rank leads to consistent improvements in error reduction and accuracy. Performance stabilizes at the medium rank, with the highest rank offering negligible additional benefits. The lowest rank setting demonstrates the weakest performance across all measured metrics.

The authors evaluate their DVD method against state-of-the-art video and image depth estimation baselines across multiple standard benchmarks. The results demonstrate that DVD achieves superior geometric fidelity and temporal coherence, consistently outperforming both generative and discriminative approaches on most metrics. Notably, the model attains these leading results while utilizing a significantly smaller training dataset compared to massive-scale competitors. DVD achieves the lowest absolute relative error on KITTI, ScanNet, and Bonn datasets. The method achieves the highest threshold accuracy across all four evaluated datasets. DVD demonstrates high data efficiency, achieving top results with a training set size much smaller than that of the VDA baseline.

The DVD method is evaluated against state-of-the-art discriminative and generative baselines across multiple standard datasets to assess geometric fidelity and temporal coherence. Results indicate that DVD consistently outperforms competitors in accuracy and efficiency, achieving top-tier performance even with a significantly smaller training dataset. Additional experiments validate that increasing overlap size enhances accuracy at the cost of inference time, while a moderate LoRA rank provides optimal performance without diminishing returns.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています