16 days ago

VMoBA: Mixture-of-Block Attention for Video Diffusion Models

Jianzong Wu, Liang Hou, Haotian Yang, Xin Tao, Ye Tian, Pengfei Wan, Di Zhang, Yunhai Tong

Abstract

The quadratic complexity of full attention mechanisms poses a significantbottleneck for Video Diffusion Models (VDMs) aiming to generate long-duration,high-resolution videos. While various sparse attention methods have beenproposed, many are designed as training-free inference accelerators or do notoptimally capture the unique spatio-temporal characteristics inherent in videodata when trained natively. This paper introduces Video Mixture of BlockAttention (VMoBA), a novel sparse attention mechanism specifically adapted forVDMs. Motivated by an in-depth analysis of attention patterns withinpre-trained video transformers, which revealed strong spatio-temporal locality,varying query importance, and head-specific concentration levels, VMoBAenhances the original MoBA framework with three key modifications: (1) alayer-wise recurrent block partition scheme (1D-2D-3D) to dynamically adapt todiverse spatio-temporal attention patterns and improve efficiency; (2) globalblock selection to prioritize the most salient query-key block interactionsacross an entire attention head; and (3) threshold-based block selection todynamically determine the number of attended blocks based on their cumulativesimilarity. Extensive experiments demonstrate that VMoBA significantlyaccelerates the training of VDMs on longer sequences, achieving 2.92x FLOPs and1.48x latency speedup, while attaining comparable or even superior generationquality to full attention. Furthermore, VMoBA exhibits competitive performancein training-free inference, offering 2.40x FLOPs and 1.35x latency speedup forhigh-res video generation.