HyperAIHyperAI

Command Palette

Search for a command to run...

VideoMLA : Cache KV latent de bas rang pour la diffusion vidéo autoregressive à l'échelle de la minute

Hidir Yesiltepe Jiazhen Hu Tuna Han Salih Meral Adil Kaan Akan Kaan Oktay Hoda Eldardiry Pinar Yanardag

Résumé

La diffusion vidéo causale long-rollout a convergé vers un cache KV à fenêtre glissante de taille fixe, les avancées récentes innovant au sein de cette disposition en modifiant les tokens qui occupent la fenêtre ou la manière dont leurs positions sont encodées. La disposition KV par tête elle-même, principal contributeur à la mémoire de streaming et à la latence, a pour l'essentiel été laissée inchangée. Dans cet article, nous présentons la première étude de l'Attention Latente Multi-Têtes (MLA) appliquée à la diffusion vidéo. VideoMLA remplace les clés et valeurs par tête par un latent de contenu à faible rang partagé et une clé positionnelle 3D-RoPE découplée et partagée, réduisant la mémoire KV par token de 92,7 % à chaque couche mise en cache. Nous investiguons par ailleurs pourquoi MLA réussit dans le contexte de la diffusion vidéo, bien que l'hypothèse spectrale souvent invoquée pour la justifier dans les modèles de langage ne s'y vérifie pas : l'attention vidéo pré-entraînée n'est pas à faible rang, son rang effectif à 99 % d'énergie se situant bien au-delà de toute dimension latente praticable. VideoMLA préserve la qualité à des taux de compression pour lesquels une approximation spectrale directe prédirait une erreur de reconstruction importante. Nous démontrons que le goulot d'étranglement de MLA, plutôt que le spectre pré-entraîné, détermine le rang effectif : tant l'initialisation spectrale que l'initialisation aléatoire occupent quasi-intégralement le budget de rang dès l'initialisation, et l'entraînement préserve ce budget tout en s'y adaptant. Sur VBench, VideoMLA égale les méthodes de référence de diffusion vidéo en streaming à court horizon, obtient le meilleur score global à long horizon parmi les méthodes évaluées, et améliore le débit de 1,23x sur un seul B200.

One-sentence Summary

VideoMLA integrates Multi-Head Latent Attention into autoregressive video diffusion by replacing per-head key-value caches with a shared low-rank content latent and a decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer while preserving generation quality despite the pretrained attention spectrum exceeding practical latent dimensions.

Key Contributions

  • The paper introduces VideoMLA, the first adaptation of Multi-Head Latent Attention to long-rollout causal video diffusion, which replaces dense per-head keys and values with a shared low-rank content latent and a decoupled 3D-RoPE positional key.
  • Theoretical analysis demonstrates that the MLA bottleneck, rather than the pretrained attention spectrum, determines the effective rank and enables the model to preserve generation quality at high compression ratios despite the pretrained video attention exceeding practical latent dimensions.
  • Empirical evaluations show that VideoMLA reduces per-token key-value memory by 92.7% across every cached layer while maintaining generation fidelity at compression levels where direct spectral approximation would predict significant reconstruction error.

Introduction

Autoregressive video diffusion models convert bidirectional teachers into streaming students that generate frames sequentially using a rolling key-value cache, enabling efficient long-horizon video synthesis. Prior methods improve generation stability or reduce compute by optimizing cache windows or replacing attention mechanisms, yet they either preserve dense per-head key-value layouts or skip token-level compression across cached layers. The authors leverage Multi-Head Latent Attention to bridge this gap, adapting the architecture to video diffusion where memory profiles and attention spectra differ significantly from language models. By replacing traditional caches with a shared low-rank latent representation, they substantially reduce memory overhead without sacrificing streaming generation quality.

Dataset

  • Dataset composition and sources: The authors do not provide dataset composition or source information in the submitted excerpt.
  • Key details for each subset: No subset sizes, origins, or filtering criteria are detailed in the text.
  • How the paper uses the data: The authors do not specify training splits, mixture ratios, or how the data integrates into the model.
  • Processing details: The excerpt omits any cropping strategies, metadata construction, or additional preprocessing steps.

Method

The authors present VideoMLA, a novel approach to reducing the per-token key-value (KV) cache memory in causal video diffusion models by rethinking the per-head KV layout. The framework replaces dense, per-head keys and values with a shared low-rank content latent and a head-shared decoupled 3D-RoPE positional key, significantly reducing memory footprint. The core of the design is a two-part decomposition of the attention mechanism. Each video latent token xtx_txt is first compressed into a shared content latent ctKVRdcc_t^{KV} \in \mathbb{R}^{d_c}ctKVRdc via a joint down-projection WKVW_{\downarrow}^{KV}WKV, which is written to the compressed KV cache. This latent represents the content of the token. Positional information is decoupled and stored separately. For each head hhh, the per-head key kt,hnopek_{t,h}^{\text{nope}}kt,hnope and value vt,hv_{t,h}vt,h are reconstructed from ctKVc_t^{KV}ctKV using head-specific up-projections W,hKW_{\uparrow,h}^KW,hK and W,hVW_{\uparrow,h}^VW,hV. This shared latent structure means a single cache read produces all per-head keys and values, eliminating the need to store dense per-head states for each token. The positional information is handled by a separate RoPE branch. A single, head-shared positional key ktRk_t^RktR is computed from xtx_txt via WRKW_R^KWRK and stored in the cache. At attention time, this unrotated key is rotated by 3D-RoPE to produce ktropek_t^{\text{rope}}ktrope, which is used in the attention score calculation. The query path follows a similar structure, with a query latent ctQc_t^QctQ derived from xtx_txt, and a head-specific positional query qt,hRq_{t,h}^Rqt,hR computed from ctQc_t^QctQ and rotated. The attention score for head hhh combines the content-based score from the inner product of qt,hnopeq_{t,h}^{\text{nope}}qt,hnope and kt,hnopek_{t,h}^{\text{nope}}kt,hnope with the positional score from qt,hropeq_{t,h}^{\text{rope}}qt,hrope and ktropek_t^{\text{rope}}ktrope. This design reduces the per-token cache size from 2nhdh2n_h d_h2nhdh scalars to dc+dhroped_c + d_h^{\text{rope}}dc+dhrope. As shown in the figure below, this results in a significant reduction in memory and a more efficient inference process, as the cache stores a compressed content latent and a shared positional key, with the per-head components reconstructed only when needed for attention computation.

Experiment

The evaluation combines fixed-memory batch scaling tests with a human perceptual study to assess prompt adherence, temporal coherence, and motion consistency in generated videos. Ablation experiments validate that VideoMLA effectively converts cache compression into substantial serving headroom while establishing a clear quality-efficiency trade-off in latent dimension selection. The findings demonstrate that moderate compression preserves essential visual details and that balancing cached content with positional encoding channels optimally supports streaming video generation. These results confirm the method's practical deployment potential, with future work targeting extended horizons and higher resolutions.

The authors analyze the memory and computational complexity of different attention mechanisms, showing that MLA Local reduces memory and computation compared to causal full and causal linear variants by leveraging a compressed latent representation and a split attention design. Results indicate that the proposed approach achieves significant reductions in both metrics while maintaining efficiency and performance. MLA Local reduces memory and computational complexity compared to causal full and causal linear attention mechanisms. The approach achieves lower memory and computation by using a compressed latent representation and a split attention design. The design allows for efficient scaling while maintaining performance through optimized resource allocation.

The authors conduct ablation studies to analyze the impact of key hyperparameters on VideoMLA's performance, focusing on memory efficiency and quality trade-offs. Results show that reducing the KV cache dimension significantly improves memory scalability while maintaining acceptable quality, with optimal performance achieved through a balanced allocation of channels between cached content and positional encoding. The model is trained in three stages using a combination of teacher forcing, consistency distillation, and distribution matching distillation. Reducing the KV cache dimension leads to substantial memory savings and enables larger batch sizes without exceeding memory limits. A balanced allocation of channels between cached content and positional encoding yields the best performance, with a dedicated RoPE subspace improving temporal and spatial coherence. The training pipeline involves multiple stages, including teacher forcing, consistency distillation, and distribution matching distillation, to optimize the model for efficient inference.

The authors compare VideoMLA against other autoregressive models in terms of parameters, resolution, throughput, latency, and performance on CLIP and HPSv3 metrics. Results show that VideoMLA achieves higher throughput and lower latency compared to frame-wise models, while maintaining competitive or superior performance on quality metrics. VideoMLA achieves higher throughput and lower latency than frame-wise autoregressive models. VideoMLA outperforms other chunk-wise autoregressive models in CLIP-T and HPSv3 scores. VideoMLA demonstrates a balance between efficiency and quality, achieving competitive performance with reduced latency.

The authors conduct ablation studies to evaluate the impact of latent dimension and positional encoding configurations on model performance. Results show that increasing the latent dimension improves quality but reduces memory efficiency, while the balance between positional and content channels significantly affects both semantic fidelity and temporal coherence. The optimal configuration achieves a trade-off between memory savings and video quality. Increasing the latent dimension improves quality but reduces memory efficiency, with diminishing returns at higher values. The split between positional and content channels affects semantic fidelity and temporal coherence, with an optimal balance favoring content channels. The best-performing configuration achieves a balance between memory savings and video quality by optimizing both latent dimension and channel split.

The authors compare their VideoMLA model against several baselines across multiple metrics, including quality and consistency scores. Results show that VideoMLA achieves competitive or superior performance on most evaluation criteria, particularly in long-horizon generation and perceptual quality, while maintaining efficient memory usage through latent compression. VideoMLA outperforms or matches existing methods on key quality and consistency metrics across different generation lengths. The model achieves high perceptual quality in user studies, with strong ratings in prompt adherence and motion consistency. VideoMLA maintains strong performance while significantly reducing memory requirements through efficient latent compression.

The experimental evaluation assesses the memory efficiency, computational complexity, and generation quality of VideoMLA through ablation studies and comparative benchmarks against existing autoregressive models. Results validate that compressed latent representations and split attention designs substantially reduce resource consumption while preserving core model capabilities. Hyperparameter analyses further demonstrate that balancing cached content with positional encoding and optimizing latent dimensions achieves an ideal trade-off between memory savings and video fidelity. Ultimately, the approach delivers competitive generation quality and temporal consistency alongside significantly reduced latency compared to conventional frame-wise baselines.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp