Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection

Video Moment Retrieval (MR) and Highlight Detection (HD) have attractedsignificant attention due to the growing demand for video analysis. Recentapproaches treat MR and HD as similar video grounding problems and address themtogether with transformer-based architecture. However, we observe that theemphasis of MR and HD differs, with one necessitating the perception of localrelationships and the other prioritizing the understanding of global contexts.Consequently, the lack of task-specific design will inevitably lead tolimitations in associating the intrinsic specialty of two tasks. To tackle theissue, we propose a Unified Video COMprehension framework (UVCOM) to bridge thegap and jointly solve MR and HD effectively. By performing progressiveintegration on intra and inter-modality across multi-granularity, UVCOMachieves the comprehensive understanding in processing a video. Moreover, wepresent multi-aspect contrastive learning to consolidate the local relationmodeling and global knowledge accumulation via well aligned multi-modal space.Extensive experiments on QVHighlights, Charades-STA, TACoS , YouTube Highlightsand TVSum datasets demonstrate the effectiveness and rationality of UVCOM whichoutperforms the state-of-the-art methods by a remarkable margin.