VideoMamba: State Space Model for Efficient Video Understanding

Addressing the dual challenges of local redundancy and global dependencies invideo understanding, this work innovatively adapts the Mamba to the videodomain. The proposed VideoMamba overcomes the limitations of existing 3Dconvolution neural networks and video transformers. Its linear-complexityoperator enables efficient long-term modeling, which is crucial forhigh-resolution long video understanding. Extensive evaluations revealVideoMamba's four core abilities: (1) Scalability in the visual domain withoutextensive dataset pretraining, thanks to a novel self-distillation technique;(2) Sensitivity for recognizing short-term actions even with fine-grainedmotion differences; (3) Superiority in long-term video understanding,showcasing significant advancements over traditional feature-based models; and(4) Compatibility with other modalities, demonstrating robustness inmulti-modal contexts. Through these distinct advantages, VideoMamba sets a newbenchmark for video understanding, offering a scalable and efficient solutionfor comprehensive video understanding. All the code and models are available athttps://github.com/OpenGVLab/VideoMamba.