8 months ago

Multimodal Representation

Video Processing

Computer Vision

Gordeev Aleksandr ; Dokholyan Vladimir ; Tolstykh Irina ; Kuprashevich Maksim

Abstract

Existing approaches for video moment retrieval and highlight detection arenot able to align text and video features efficiently, resulting inunsatisfying performance and limited production usage. To address this, wepropose a novel architecture that utilizes recent foundational video modelsdesigned for such alignment. Combined with the introduced Saliency-Guided CrossAttention mechanism and a hybrid DETR architecture, our approach significantlyenhances performance in both moment retrieval and highlight detection tasks.For even better improvement, we developed InterVid-MR, a large-scale andhigh-quality dataset for pretraining. Using it, our architecture achievesstate-of-the-art results on the QVHighlights, Charades-STA and TACoSbenchmarks. The proposed approach provides an efficient and scalable solutionfor both zero-shot and fine-tuning scenarios in video-language tasks.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Multimodal Representation

Video Processing

Computer Vision

Gordeev Aleksandr ; Dokholyan Vladimir ; Tolstykh Irina ; Kuprashevich Maksim

Abstract

Existing approaches for video moment retrieval and highlight detection arenot able to align text and video features efficiently, resulting inunsatisfying performance and limited production usage. To address this, wepropose a novel architecture that utilizes recent foundational video modelsdesigned for such alignment. Combined with the introduced Saliency-Guided CrossAttention mechanism and a hybrid DETR architecture, our approach significantlyenhances performance in both moment retrieval and highlight detection tasks.For even better improvement, we developed InterVid-MR, a large-scale andhigh-quality dataset for pretraining. Using it, our architecture achievesstate-of-the-art results on the QVHighlights, Charades-STA and TACoSbenchmarks. The proposed approach provides an efficient and scalable solutionfor both zero-shot and fine-tuning scenarios in video-language tasks.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp