Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

In this paper we revisit feature fusion, an old-fashioned topic, in the newcontext of text-to-video retrieval. Different from previous research thatconsiders feature fusion only at one end, let it be video or text, we aim forfeature fusion for both ends within a unified framework. We hypothesize thatoptimizing the convex combination of the features is preferred to modelingtheir correlations by computationally heavy multi-head self attention. Wepropose Lightweight Attentional Feature Fusion (LAFF). LAFF performs featurefusion at both early and late stages and at both video and text ends, making ita powerful method for exploiting diverse (off-the-shelf) features. Theinterpretability of LAFF can be used for feature selection. Extensiveexperiments on five public benchmark sets (MSR-VTT, MSVD, TGIF, VATEX andTRECVID AVS 2016-2020) justify LAFF as a new baseline for text-to-videoretrieval.