HyperAIHyperAI

Command Palette

Search for a command to run...

VideoMamba: State Space Model for Efficient Video Understanding

Li Kunchang ; Li Xinhao ; Wang Yi ; He Yinan ; Wang Yali ; Wang Limin ; Qiao Yu

Abstract

Addressing the dual challenges of local redundancy and global dependencies invideo understanding, this work innovatively adapts the Mamba to the videodomain. The proposed VideoMamba overcomes the limitations of existing 3Dconvolution neural networks and video transformers. Its linear-complexityoperator enables efficient long-term modeling, which is crucial forhigh-resolution long video understanding. Extensive evaluations revealVideoMamba's four core abilities: (1) Scalability in the visual domain withoutextensive dataset pretraining, thanks to a novel self-distillation technique;(2) Sensitivity for recognizing short-term actions even with fine-grainedmotion differences; (3) Superiority in long-term video understanding,showcasing significant advancements over traditional feature-based models; and(4) Compatibility with other modalities, demonstrating robustness inmulti-modal contexts. Through these distinct advantages, VideoMamba sets a newbenchmark for video understanding, offering a scalable and efficient solutionfor comprehensive video understanding. All the code and models are available athttps://github.com/OpenGVLab/VideoMamba.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
VideoMamba: State Space Model for Efficient Video Understanding | Papers | HyperAI