8 months ago

Abstract

Video Highlight Detection and Moment Retrieval (HD/MR) are essential in videoanalysis. Recent joint prediction transformer models often overlook theircross-task dynamics and video-text alignment and refinement. Moreover, mostmodels typically use limited, uni-directional attention mechanisms, resultingin weakly integrated representations and suboptimal performance in capturingthe interdependence between video and text modalities. Although large-languageand vision-language models (LLM/LVLMs) have gained prominence across variousdomains, their application in this field remains relatively underexplored. Herewe propose VideoLights, a novel HD/MR framework addressing these limitationsthrough (i) Convolutional Projection and Feature Refinement modules with analignment loss for better video-text feature alignment, (ii) Bi-DirectionalCross-Modal Fusion network for strongly coupled query-aware cliprepresentations, and (iii) Uni-directional joint-task feedback mechanismenhancing both tasks through correlation. In addition, (iv) we introduce hardpositive/negative losses for adaptive error penalization and improved learning,and (v) leverage LVLMs like BLIP-2 for enhanced multimodal feature integrationand intelligent pretraining using synthetic data generated from LVLMs.Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarksdemonstrate state-of-the-art performance. Codes and models are available athttps://github.com/dpaul06/VideoLights .

Source PDF