HyperAIHyperAI

Command Palette

Search for a command to run...

DA-Flow: Diffusion Models を用いた Degradation-Aware な Optical Flow 推定

Jaewon Min Jaeeun Lee Yeji Choi Paul Hyunbin Cho Jin Hyeon Kim Tae-Young Lee Jongsik Ahn Hwayeong Lee Seonghyun Park Seungryong Kim

概要

高品質なデータで訓練されたオプティカルフローモデルは、ぼやけ、ノイズ、圧縮アーティファクトなどの実世界の劣化に直面すると、性能が著しく低下する傾向がある。この限界を克服するため、本研究では、実世界の劣化を伴う動画から高精度な密対応推定を目指す新たなタスク「Degradation-Aware Optical Flow」を定義する。我々の核心的な洞察は、画像復元用の Diffusion モデルの中間表現が本質的に劣化に敏感である一方、時間的知覚を欠いている点にある。この課題に対処するため、モデルに隣接フレーム間での注意機構を付与し、完全な時空間注意(full spatio-temporal attention)を実現する。実験により、こうして得られた特徴量がゼロショット対応能力を有することを実証した。この知見に基づき、反復的精緻化フレームワーク内でこれらの Diffusion 特徴と畳み込み特徴を融合するハイブリッドアーキテクチャ「DA-Flow」を提案する。DA-Flow は、複数のベンチマークにおいて、重度の劣化条件下でも既存のオプティカルフロー手法を大幅に上回る性能を示す。

One-sentence Summary

Researchers from KAIST AI and Hanwha Systems introduce DA-Flow, a hybrid optical flow model that lifts pretrained image restoration Diffusion features with full spatio-temporal attention to achieve robust dense correspondence estimation under severe real-world corruptions where existing methods fail.

Key Contributions

  • The paper formulates Degradation-Aware Optical Flow as a new task designed to estimate accurate dense correspondences from severely corrupted videos rather than focusing solely on robustness.
  • A pretrained image restoration Diffusion model is lifted to handle multiple frames by injecting inter-frame attention, creating features that encode geometric correspondence even under severe corruption.
  • DA-Flow is introduced as a hybrid architecture that fuses these diffusion features with convolutional features within an iterative refinement framework, demonstrating superior performance on degraded benchmarks where existing methods fail.

Introduction

Optical flow estimation is critical for video analysis, yet existing models trained on clean data fail significantly when faced with real-world corruptions like blur, noise, and compression artifacts. Prior attempts to address this often rely on synthetic data augmentation or video diffusion backbones that entangle temporal information too early, which destroys the independent spatial structure required for precise pixel-level matching. The authors leverage intermediate features from pretrained image restoration Diffusion models, which naturally encode degradation patterns and geometric structure, and lift them to handle video by injecting cross-frame attention. They introduce DA-Flow, a hybrid architecture that fuses these degradation-aware diffusion features with standard convolutional features to achieve robust optical flow estimation under severe corruption where previous methods fail.

Method

The proposed method addresses Degradation-Aware Optical Flow by leveraging a pretrained DiT-based image restoration model. The authors first lift this image-level model to the video domain to enable temporal reasoning. In the original MM-DiT architecture, the temporal dimension is folded into the batch axis, causing the model to process each frame independently. To overcome this limitation, the authors reshape the modality streams to concatenate spatial tokens across all frames, transforming FmR(BF)×T×C\mathbf{F}_{m} \in \mathbb{R}^{(BF) \times T \times C}FmR(BF)×T×C to F~mRB×(FT)×C\tilde{\mathbf{F}}_{m} \in \mathbb{R}^{B \times (FT) \times C}F~mRB×(FT)×C. This modification allows for full spatio-temporal attention, where tokens can attend to all spatial locations across the entire video sequence.

As shown in the figure below:

Building upon this lifted architecture, the authors introduce DA-Flow, a degradation-aware optical flow model. The pipeline retains the correlation and iterative update operators from RAFT but replaces the standard feature encoder with a hybrid system. The overall pipeline can be formulated as Mθ=UC(Up(Dϕ),E)\mathcal{M}_{\theta} = \mathcal{U} \circ \mathcal{C} \circ (\mathrm{Up}(\mathcal{D}_{\phi}), \mathcal{E})Mθ=UC(Up(Dϕ),E). This system combines features from the lifted diffusion model with a conventional CNN encoder. Since the diffusion features operate on a coarse grid, DPT-based heads are employed to upsample them to a resolution compatible with the CNN features. Specifically, separate heads generate query, key, and context features from the diffusion model. These upsampled features are concatenated with the CNN features to form hybrid representations. The correlation operator then constructs a cost volume from the query and key features, while the context features condition the iterative update operator to refine the flow estimate. The model is trained using a multi-scale flow loss with pseudo ground-truth labels derived from high-quality frame pairs, defined as Lflow=i=1MγMifkk+1(i)fkk+11\mathcal{L}_{\mathrm{flow}} = \sum_{i=1}^{M} \gamma^{M-i} \left\| \mathbf{f}_{k \to k+1}^{(i)} - \mathbf{f}_{k \to k+1}^{*} \right\|_{1}Lflow=i=1MγMifkk+1(i)fkk+11.

Experiment

  • Diffusion feature analysis validates that query and key features from full spatio-temporal attention layers in a finetuned lifted model exhibit superior zero-shot geometric correspondence compared to untrained baselines, with stable performance across denoising timesteps.
  • Quantitative evaluations on Sintel, Spring, and TartanAir benchmarks demonstrate that DA-Flow outperforms existing methods in handling degraded inputs, achieving lower endpoint errors and significantly reduced outlier rates.
  • Qualitative results confirm that the proposed method recovers sharp and coherent flow fields under severe corruption, whereas baseline approaches produce noisy artifacts around motion boundaries and fine structures.
  • Ablation studies verify that the performance gains stem from the lifted diffusion features rather than simple fine-tuning of conventional networks, and that combining diffusion features with a CNN encoder and DPT-based upsampling is essential for optimal accuracy.
  • Application tests in video restoration show that the accurate flow estimates enable effective temporal alignment, reducing flickering and improving structural stability across consecutive frames.

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています