New Method Accelerates Video Generation, Boosting H100/A100 Performance by Over Two Times
Scientists have proposed a method to accelerate video diffusion models that achieves over 100% end-to-end speedup on both the H100 and A100 GPUs. The core innovation lies in reorganizing the hidden states of tokens as "videos" in a [t, h, w] format, where t represents time, and h and w represent height and width. This reshaping allows for efficient sampling of queries and keys, followed by average pooling operations to create a low-resolution attention map. This attention map effectively captures key areas of the video while also identifying redundant sections. The team then uses this draft attention map to generate sparse codes that guide the full-resolution attention calculations, retaining only essential connections. Through this approach, they significantly reduce the computational overhead of the attention module without compromising the quality of the generated content. Experiments showed that Draft Attention achieved up to 2 times faster performance on A100 GPUs and up to 1.75 times faster on H100 GPUs while maintaining video generation quality. This work not only offers an acceleration framework that requires no additional training, but it also provides a new perspective for optimizing high-quality video generation. Its broad applicability makes it particularly valuable in scenarios requiring fast generation speeds and high computational resources. For instance, within the next two years, the team anticipates its use in enhancing video generation platforms, lowering barriers for creators using AI to produce high-quality videos. Moreover, it can be applied to large-scale models for video generation, making them more responsive and user-friendly. The team's theoretical analysis also demonstrated that the error between this "draft" attention map and the original full-resolution attention map is controllable, and the introduced sparsity has minimal impact on the overall quality. However, initial trials revealed a significant limitation: the sparse model used by SVG is predefined and only supports two sparsity strategies. This design can degrade video generation quality at higher sparsity levels, limiting its adaptability and effectiveness. To address this issue, the team experimented with integrating max pooling into the open-source video generation model HunyuanVideo. By generating video using only 20% of the model's attention, they produced results that were visually appealing and validated their hypothesis that there is substantial redundancy in current attention calculation methods, even more than initially anticipated. Further efforts to boost computational efficiency involved rethinking the organization of sparsity-induced attention patterns. The researchers found that under the condition of pool_h × pool_w = block_size, the model could better align with existing high-efficiency attention frameworks. They designed a reordering strategy that clusters these sparse blocks into contiguous memory layouts, allowing attention calculations to run more efficiently on GPUs. The entire process is illustrated below: The draft attention map (Draft Map) corresponds to a sparse, low-resolution version of the full attention map (Full Map). Only after appropriate reordering (Reorder) do these sparse blocks coalesce into a reordered full map (Reordered Full Map), enabling efficient execution of the attention model. Overall, this research provides a valuable framework for enhancing the efficiency of large-scale, high-quality video generation models, paving the way for applications in virtual reality, gaming, and various numeric domains. It addresses the critical challenge of computational complexity, making high-quality video content more accessible and practical for a wide range of users and scenarios.