Recurrent Video Restoration Transformer with Guided Deformable Attention

Video restoration aims at restoring multiple high-quality frames frommultiple low-quality frames. Existing video restoration methods generally fallinto two extreme cases, i.e., they either restore all frames in parallel orrestore the video frame by frame in a recurrent way, which would result indifferent merits and drawbacks. Typically, the former has the advantage oftemporal information fusion. However, it suffers from large model size andintensive memory consumption; the latter has a relatively small model size asit shares parameters across frames; however, it lacks long-range dependencymodeling ability and parallelizability. In this paper, we attempt to integratethe advantages of the two cases by proposing a recurrent video restorationtransformer, namely RVRT. RVRT processes local neighboring frames in parallelwithin a globally recurrent framework which can achieve a good trade-offbetween model size, effectiveness, and efficiency. Specifically, RVRT dividesthe video into multiple clips and uses the previously inferred clip feature toestimate the subsequent clip feature. Within each clip, different framefeatures are jointly updated with implicit feature aggregation. Acrossdifferent clips, the guided deformable attention is designed for clip-to-clipalignment, which predicts multiple relevant locations from the whole inferredclip and aggregates their features by the attention mechanism. Extensiveexperiments on video super-resolution, deblurring, and denoising show that theproposed RVRT achieves state-of-the-art performance on benchmark datasets withbalanced model size, testing memory and runtime.