HyperAIHyperAI

Command Palette

Search for a command to run...

Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

Fan Hu extsuperscript1,2* Aozhu Chen extsuperscript1,2* Ziyue Wang extsuperscript1,2* Fangming Zhou extsuperscript1,2 Jianfeng Dong extsuperscript3 Xirong Li extsuperscript1,2†

Abstract

In this paper we revisit feature fusion, an old-fashioned topic, in the newcontext of text-to-video retrieval. Different from previous research thatconsiders feature fusion only at one end, let it be video or text, we aim forfeature fusion for both ends within a unified framework. We hypothesize thatoptimizing the convex combination of the features is preferred to modelingtheir correlations by computationally heavy multi-head self attention. Wepropose Lightweight Attentional Feature Fusion (LAFF). LAFF performs featurefusion at both early and late stages and at both video and text ends, making ita powerful method for exploiting diverse (off-the-shelf) features. Theinterpretability of LAFF can be used for feature selection. Extensiveexperiments on five public benchmark sets (MSR-VTT, MSVD, TGIF, VATEX andTRECVID AVS 2016-2020) justify LAFF as a new baseline for text-to-videoretrieval.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp