Command Palette
Search for a command to run...
DA-Flow: Diffusion Models를 활용한 Degradation-Aware Optical Flow Estimation
DA-Flow: Diffusion Models를 활용한 Degradation-Aware Optical Flow Estimation
Jaewon Min Jaeeun Lee Yeji Choi Paul Hyunbin Cho Jin Hyeon Kim Tae-Young Lee Jongsik Ahn Hwayeong Lee Seonghyun Park Seungryong Kim
초록
고품질 데이터로 훈련된 광학 흐름 (optical flow) 모델은 흐림, 잡음, 압축 아티팩트와 같은 실제 세계의 손상 (corruptions) 에 직면할 때 성능이 심각하게 저하되는 경향이 있습니다. 이러한 한계를 극복하기 위해, 우리는 실제 세계의 손상된 비디오로부터 정확한 밀집 대응 (dense correspondence) 추정을 목표로 하는 새로운 작업인 '손상 인지 광학 흐름 (Degradation-Aware Optical Flow)'을 정의합니다. 우리의 핵심 통찰력은 이미지 복원 Diffusion 모델의 중간 표현 (intermediate representations) 이 본질적으로 손상에 민감하지만 시간적 인식이 결여되어 있다는 점입니다. 이 한계를 해결하기 위해, 우리는 모델이 인접한 프레임 간에 주의를 기울일 수 있도록 전체 시공간 주의 (full spatio-temporal attention) 메커니즘을 도입하였으며, 이를 통해 생성된 특징이 제로샷 (zero-shot) 대응 능력을 갖는다는 것을 실증적으로 입증했습니다. 이러한 발견을 바탕으로, 우리는 Diffusion 특징과 합성곱 특징을 반복적 정제 (iterative refinement) 프레임워크 내에서 융합하는 하이브리드 아키텍처인 DA-Flow 를 제안합니다. DA-Flow 는 여러 벤치마크 (benchmark) 에서 심각한 손상 하에 기존 광학 흐름 방법들보다 현저히 우수한 성능을 보입니다.
One-sentence Summary
Researchers from KAIST AI and Hanwha Systems introduce DA-Flow, a hybrid optical flow model that lifts pretrained image restoration Diffusion features with full spatio-temporal attention to achieve robust dense correspondence estimation under severe real-world corruptions where existing methods fail.
Key Contributions
- The paper formulates Degradation-Aware Optical Flow as a new task designed to estimate accurate dense correspondences from severely corrupted videos rather than focusing solely on robustness.
- A pretrained image restoration Diffusion model is lifted to handle multiple frames by injecting inter-frame attention, creating features that encode geometric correspondence even under severe corruption.
- DA-Flow is introduced as a hybrid architecture that fuses these diffusion features with convolutional features within an iterative refinement framework, demonstrating superior performance on degraded benchmarks where existing methods fail.
Introduction
Optical flow estimation is critical for video analysis, yet existing models trained on clean data fail significantly when faced with real-world corruptions like blur, noise, and compression artifacts. Prior attempts to address this often rely on synthetic data augmentation or video diffusion backbones that entangle temporal information too early, which destroys the independent spatial structure required for precise pixel-level matching. The authors leverage intermediate features from pretrained image restoration Diffusion models, which naturally encode degradation patterns and geometric structure, and lift them to handle video by injecting cross-frame attention. They introduce DA-Flow, a hybrid architecture that fuses these degradation-aware diffusion features with standard convolutional features to achieve robust optical flow estimation under severe corruption where previous methods fail.
Method
The proposed method addresses Degradation-Aware Optical Flow by leveraging a pretrained DiT-based image restoration model. The authors first lift this image-level model to the video domain to enable temporal reasoning. In the original MM-DiT architecture, the temporal dimension is folded into the batch axis, causing the model to process each frame independently. To overcome this limitation, the authors reshape the modality streams to concatenate spatial tokens across all frames, transforming Fm∈R(BF)×T×C to F~m∈RB×(FT)×C. This modification allows for full spatio-temporal attention, where tokens can attend to all spatial locations across the entire video sequence.
As shown in the figure below:

Building upon this lifted architecture, the authors introduce DA-Flow, a degradation-aware optical flow model. The pipeline retains the correlation and iterative update operators from RAFT but replaces the standard feature encoder with a hybrid system. The overall pipeline can be formulated as Mθ=U∘C∘(Up(Dϕ),E). This system combines features from the lifted diffusion model with a conventional CNN encoder. Since the diffusion features operate on a coarse grid, DPT-based heads are employed to upsample them to a resolution compatible with the CNN features. Specifically, separate heads generate query, key, and context features from the diffusion model. These upsampled features are concatenated with the CNN features to form hybrid representations. The correlation operator then constructs a cost volume from the query and key features, while the context features condition the iterative update operator to refine the flow estimate. The model is trained using a multi-scale flow loss with pseudo ground-truth labels derived from high-quality frame pairs, defined as Lflow=∑i=1MγM−ifk→k+1(i)−fk→k+1∗1.
Experiment
- Diffusion feature analysis validates that query and key features from full spatio-temporal attention layers in a finetuned lifted model exhibit superior zero-shot geometric correspondence compared to untrained baselines, with stable performance across denoising timesteps.
- Quantitative evaluations on Sintel, Spring, and TartanAir benchmarks demonstrate that DA-Flow outperforms existing methods in handling degraded inputs, achieving lower endpoint errors and significantly reduced outlier rates.
- Qualitative results confirm that the proposed method recovers sharp and coherent flow fields under severe corruption, whereas baseline approaches produce noisy artifacts around motion boundaries and fine structures.
- Ablation studies verify that the performance gains stem from the lifted diffusion features rather than simple fine-tuning of conventional networks, and that combining diffusion features with a CNN encoder and DPT-based upsampling is essential for optimal accuracy.
- Application tests in video restoration show that the accurate flow estimates enable effective temporal alignment, reducing flickering and improving structural stability across consecutive frames.