Command Palette
Search for a command to run...
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video
Retrieval
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval
Fan Hu extsuperscript1,2* Aozhu Chen extsuperscript1,2* Ziyue Wang extsuperscript1,2* Fangming Zhou extsuperscript1,2 Jianfeng Dong extsuperscript3 Xirong Li extsuperscript1,2†
Abstract
In this paper we revisit feature fusion, an old-fashioned topic, in the newcontext of text-to-video retrieval. Different from previous research thatconsiders feature fusion only at one end, let it be video or text, we aim forfeature fusion for both ends within a unified framework. We hypothesize thatoptimizing the convex combination of the features is preferred to modelingtheir correlations by computationally heavy multi-head self attention. Wepropose Lightweight Attentional Feature Fusion (LAFF). LAFF performs featurefusion at both early and late stages and at both video and text ends, making ita powerful method for exploiting diverse (off-the-shelf) features. Theinterpretability of LAFF can be used for feature selection. Extensiveexperiments on five public benchmark sets (MSR-VTT, MSVD, TGIF, VATEX andTRECVID AVS 2016-2020) justify LAFF as a new baseline for text-to-videoretrieval.