HyperAI

AdaCache Accelerates Video Generation

AdaCache is a technology proposed by Meta in 2024 to accelerate AI video generation. Its core is the adaptive cache mechanism. The related paper results are "Adaptive Caching for Faster Video Generation with Diffusion Transformers". It optimizes the allocation of computing resources, dynamically adjusts the amount of computing according to the complexity of different video content, and reduces unnecessary computing overhead. AdaCache introduces a motion regularization strategy, using the motion information in the video to further optimize cache decisions. Experiments show that AdaCache significantly improves the generation speed while maintaining video quality, and has significant effects in a multi-GPU environment. It has important application value and development prospects in the field of video generation.

Specifically, the AdaCache approach requires no training and can be seamlessly integrated into the baseline video diffusion transformer as a plug-and-play component at the inference stage. The core idea of the scheme is to cache the residual calculations (such as attention or multi-layer perceptron outputs) within the transformer module at a specific diffusion step, and reuse these cached results in several subsequent steps based on the generated video. The research team achieved this by formulating a caching plan, that is, whenever a residual calculation is performed, it is decided when to recalculate it next. This decision is guided by a distance metric that measures the rate of change between the previously stored representation and the current representation. If the distance is large, it will not be cached for a long time (i.e., several steps) to avoid reusing incompatible representations.

The researchers further introduced Motion Regularization (MoReg) to distribute computational tasks according to the motion content in the video being generated. This was inspired by the observation that high-dynamic sequences require more diffusion steps to achieve reasonable quality.

Overall, this pipeline is applied on multiple video diffusion transformer baselines and shows faster inference speed without sacrificing generation quality.