High-Resolution Frame Interpolation with Patch-based Cascaded Diffusion

Despite the recent progress, existing frame interpolation methods stillstruggle with processing extremely high resolution input and handlingchallenging cases such as repetitive textures, thin objects, and large motion.To address these issues, we introduce a patch-based cascaded pixel diffusionmodel for high resolution frame interpolation, HiFI, that excels in thesescenarios while achieving competitive performance on standard benchmarks.Cascades, which generate a series of images from low to high resolution, canhelp significantly with large or complex motion that require both globalcontext for a coarse solution and detailed context for high resolution output.However, contrary to prior work on cascaded diffusion models which performdiffusion on increasingly large resolutions, we use a single model that alwaysperforms diffusion at the same resolution and upsamples by processing patchesof the inputs and the prior solution. At inference time, this drasticallyreduces memory usage and allows a single model, solving both frameinterpolation (base model's task) and spatial up-sampling, saving training costas well. HiFI excels at high-resolution images and complex repeated texturesthat require global context, achieving comparable or state-of-the-artperformance on various benchmarks (Vimeo, Xiph, X-Test, and SEPE-8K). Wefurther introduce a new dataset, LaMoR, that focuses on particularlychallenging cases, and HiFI significantly outperforms other baselines. Pleasevisit our project page for video results: https://hifi-diffusion.github.io