8 months ago

Abstract

Joint video-language learning has received increasing attention in recentyears. However, existing works mainly focus on single or multiple trimmed videoclips (events), which makes human-annotated event boundaries necessary duringinference. To break away from the ties, we propose a grounded vision-languagelearning framework for untrimmed videos, which automatically detectsinformative events and effectively excavates the alignments betweenmulti-sentence descriptions and corresponding event segments. Instead ofcoarse-level video-language alignments, we present two dual pretext tasks toencourage fine-grained segment-level alignments, i.e., text-to-event grounding(TEG) and event-to-text generation (ETG). TEG learns to adaptively ground thepossible event proposals given a set of sentences by estimating the cross-modaldistance in a joint semantic space. Meanwhile, ETG aims to reconstruct(generate) the matched texts given event proposals, encouraging the eventrepresentation to retain meaningful semantic information. To encourage accuratelabel assignment between the event set and the text set, we propose a novelsemantic-aware cost to mitigate the sub-optimal matching results caused byambiguous boundary annotations. Our framework is easily extensible to taskscovering visually-grounded language understanding and generation. We achievestate-of-the-art dense video captioning performance on ActivityNet Captions,YouCook2 and YouMakeup, and competitive performance on several other languagegeneration and understanding tasks. Our method also achieved 1st place in boththe MTVG and MDVC tasks of the PIC 4th Challenge. Our code is publiclyavailable at https://github.com/zjr2000/GVL.

Source PDF View Code