FLAVR: Flow-Agnostic Video Representations for Fast Frame Interpolation

A majority of methods for video frame interpolation compute bidirectionaloptical flow between adjacent frames of a video, followed by a suitable warpingalgorithm to generate the output frames. However, approaches relying on opticalflow often fail to model occlusions and complex non-linear motions directlyfrom the video and introduce additional bottlenecks unsuitable for widespreaddeployment. We address these limitations with FLAVR, a flexible and efficientarchitecture that uses 3D space-time convolutions to enable end-to-end learningand inference for video frame interpolation. Our method efficiently learns toreason about non-linear motions, complex occlusions and temporal abstractions,resulting in improved performance on video interpolation, while requiring noadditional inputs in the form of optical flow or depth maps. Due to itssimplicity, FLAVR can deliver 3x faster inference speed compared to the currentmost accurate method on multi-frame interpolation without losing interpolationaccuracy. In addition, we evaluate FLAVR on a wide range of challengingsettings and consistently demonstrate superior qualitative and quantitativeresults compared with prior methods on various popular benchmarks includingVimeo-90K, UCF101, DAVIS, Adobe, and GoPro. Finally, we demonstrate that FLAVRfor video frame interpolation can serve as a useful self-supervised pretexttask for action recognition, optical flow estimation, and motion magnification.