8 months ago

Abstract

Video Moment Retrieval and Highlight Detection aim to find correspondingcontent in the video based on a text query. Existing models usually first usecontrastive learning methods to align video and text features, then fuse andextract multimodal information, and finally use a Transformer Decoder to decodemultimodal information. However, existing methods face several issues: (1)Overlapping semantic information between different samples in the datasethinders the model's multimodal aligning performance; (2) Existing models arenot able to efficiently extract local features of the video; (3) TheTransformer Decoder used by the existing model cannot adequately decodemultimodal features. To address the above issues, we proposed the LD-DETR modelfor Video Moment Retrieval and Highlight Detection tasks. Specifically, wefirst distilled the similarity matrix into the identity matrix to mitigate theimpact of overlapping semantic information. Then, we designed a method thatenables convolutional layers to extract multimodal local features moreefficiently. Finally, we fed the output of the Transformer Decoder back intoitself to adequately decode multimodal information. We evaluated LD-DETR onfour public benchmarks and conducted extensive experiments to demonstrate thesuperiority and effectiveness of our approach. Our model outperforms theState-Of-The-Art models on QVHighlight, Charades-STA and TACoS datasets. Ourcode is available at https://github.com/qingchen239/ld-detr.

Source PDF