Dual Encoding for Video Retrieval by Text

This paper attacks the challenging problem of video retrieval by text. Insuch a retrieval paradigm, an end user searches for unlabeled videos by ad-hocqueries described exclusively in the form of a natural-language sentence, withno visual example provided. Given videos as sequences of frames and queries assequences of words, an effective sequence-to-sequence cross-modal matching iscrucial. To that end, the two modalities need to be first encoded intoreal-valued vectors and then projected into a common space. In this paper weachieve this by proposing a dual deep encoding network that encodes videos andqueries into powerful dense representations of their own. Our novelty istwo-fold. First, different from prior art that resorts to a specificsingle-level encoder, the proposed network performs multi-level encoding thatrepresents the rich content of both modalities in a coarse-to-fine fashion.Second, different from a conventional common space learning algorithm which iseither concept based or latent space based, we introduce hybrid space learningwhich combines the high performance of the latent space and the goodinterpretability of the concept space. Dual encoding is conceptually simple,practically effective and end-to-end trained with hybrid space learning.Extensive experiments on four challenging video datasets show the viability ofthe new method.