FlashVTG: Feature Layering and Adaptive Score Handling Network for Video Temporal Grounding

Text-guided Video Temporal Grounding (VTG) aims to localize relevant segmentsin untrimmed videos based on textual descriptions, encompassing two subtasks:Moment Retrieval (MR) and Highlight Detection (HD). Although previous typicalmethods have achieved commendable results, it is still challenging to retrieveshort video moments. This is primarily due to the reliance on sparse andlimited decoder queries, which significantly constrain the accuracy ofpredictions. Furthermore, suboptimal outcomes often arise because previousmethods rank predictions based on isolated predictions, neglecting the broadervideo context. To tackle these issues, we introduce FlashVTG, a frameworkfeaturing a Temporal Feature Layering (TFL) module and an Adaptive ScoreRefinement (ASR) module. The TFL module replaces the traditional decoderstructure to capture nuanced video content variations across multiple temporalscales, while the ASR module improves prediction ranking by integrating contextfrom adjacent moments and multi-temporal-scale features. Extensive experimentsdemonstrate that FlashVTG achieves state-of-the-art performance on four widelyadopted datasets in both MR and HD. Specifically, on the QVHighlights dataset,it boosts mAP by 5.8% for MR and 3.3% for HD. For short-moment retrieval,FlashVTG increases mAP to 125% of previous SOTA performance. All theseimprovements are made without adding training burdens, underscoring itseffectiveness. Our code is available at https://github.com/Zhuo-Cao/FlashVTG.