HyperAIHyperAI

Command Palette

Search for a command to run...

Quant VideoGen: Auto-regressivere Generierung langer Videos mittels 2-Bit-KV-Cache-Quantisierung

Zusammenfassung

Trotz des schnellen Fortschritts bei autoregressiven Video-Diffusionsmodellen bleibt eine emerging systemalgorithmische Engstelle – der KV-Cache-Speicher – sowohl die Einsatzbarkeit als auch die Generierungskapazität limitierend: In autoregressiven Video-Generierungsmodellen wächst der KV-Cache mit der Generierungsgeschichte und dominiert schnell den GPU-Speicher, oft überschreitend 30 GB, was eine Bereitstellung auf allgemein verfügbaren Hardwareplattformen unmöglich macht. Noch kritischer ist, dass begrenzte KV-Cache-Budgets die effektive Arbeitsspeicherkapazität einschränken und somit die Konsistenz über längere Horizonte hinsichtlich Identität, Layout und Bewegung direkt beeinträchtigen. Um diese Herausforderung zu bewältigen, präsentieren wir Quant VideoGen (QVG), einen trainingsfreien Ansatz zur Quantisierung des KV-Caches für autoregressive Video-Diffusionsmodelle. QVG nutzt die spatiotemporale Redundanz in Videos durch semantikbewusste Glättung (Semantic Aware Smoothing), um niedrigwertige, quantisierungsgeeignete Restwerte zu erzeugen. Zudem führt es Progressive Residual Quantization ein, eine mehrstufige, von grob nach fein angeordnete Strategie, die die Quantisierungsfehler reduziert und gleichzeitig einen glatten Qualitäts-Speicher-Trade-off ermöglicht. Auf den Benchmarks LongCat Video, HY WorldPlay und Self Forcing etabliert QVG eine neue Pareto-Front zwischen Qualität und Speichereffizienz: Der KV-Cache-Speicher kann bis zu 7,0-fach reduziert werden, mit einer End-zu-End-Verzögerung von weniger als 4 %, während QVG konsistent die bestehenden Baselines in der Generierungsqualität übertrifft.

One-sentence Summary

Researchers from MIT, UC Berkeley, and Tsinghua propose Quant VideoGen (QVG), a training-free KV-cache quantization method that leverages spatiotemporal redundancy and progressive residual quantization to cut memory use by 7× while preserving video consistency and quality across long-horizon generation tasks.

Key Contributions

  • Auto-regressive video diffusion models face a critical KV-cache memory bottleneck that limits deployment on consumer hardware and degrades long-horizon consistency in identity, layout, and motion due to forced memory budgeting.
  • Quant VideoGen (QVG) introduces a training-free quantization framework leveraging Semantic-Aware Smoothing and Progressive Residual Quantization to exploit spatiotemporal redundancy, producing low-magnitude, quantization-friendly residuals with coarse-to-fine error reduction.
  • Evaluated on LongCat-Video, HY-WorldPlay, and Self-Forcing, QVG reduces KV memory up to 7.0× with <4% latency overhead, enables HY-WorldPlay-8B to run on a single RTX 4090, and achieves higher PSNR than baselines under constrained memory.

Introduction

The authors leverage auto-regressive video diffusion models to enable long-horizon video generation, which is critical for applications like live streaming, interactive content, and world modeling. However, these models face a severe memory bottleneck: the KV-cache grows linearly with video length and quickly exceeds GPU capacity, forcing short context windows that degrade consistency in identity, motion, and layout. Prior KV-cache quantization methods from LLMs fail on video due to its heterogeneous activation statistics and lack of spatiotemporal awareness. Their main contribution, Quant VideoGen (QVG), is a training-free framework that exploits video’s spatiotemporal redundancy via Semantic-Aware Smoothing—grouping similar tokens and subtracting centroids to create low-magnitude residuals—and Progressive Residual Quantization, a multi-stage compression scheme that refines quantization error. QVG reduces KV-cache memory by up to 7x with under 4% latency overhead, enabling high-quality, minute-long generation on consumer GPUs and setting a new quality-memory Pareto frontier.

Method

The authors leverage a two-stage quantization framework—Semantic-Aware Smoothing followed by Progressive Residual Quantization—to address the challenges of quantizing video KV-cache, which exhibits both high dynamic range and spatiotemporal redundancy. The overall pipeline is designed to progressively reduce quantization error by exploiting semantic similarity and temporal structure inherent in video tokens.

The process begins with Semantic-Aware Smoothing, which operates on chunks of tokens (e.g., N=HWTcN = HWT_cN=HWTc tokens per chunk) extracted from the KV-cache tensor XRN×d\mathbf{X} \in \mathbb{R}^{N \times d}XRN×d. The authors apply kkk-means clustering to partition tokens into CCC disjoint groups G={G1,,GC}\mathcal{G} = \{\mathcal{G}_1, \ldots, \mathcal{G}_C\}G={G1,,GC} based on their hidden representations. Each group’s centroid CiRd\mathbf{C}_i \in \mathbb{R}^dCiRd is computed as the mean of its members. The residual for each group is then derived via centroid subtraction:

Ri=XGiCi,RiRGi×d\mathbf{R}_i = \mathbf{X}_{\mathcal{G}_i} - \mathbf{C}_i, \quad \mathbf{R}_i \in \mathbb{R}^{|\mathcal{G}_i| \times d}Ri=XGiCi,RiRGi×d

This step effectively reduces the dynamic range within each group, as large outlier values are captured in the centroids and subtracted out. The result is a residual tensor R\mathbf{R}R with significantly lower maximum magnitude, which directly reduces quantization error since E[xx^]SX\mathbb{E}[\|x - \hat{x}\|] \propto S_XE[xx^]SX, and SXS_XSX is proportional to the maximum absolute value in the group.

Refer to the framework diagram, which illustrates how the original KV-cache (a) is transformed through semantic grouping and centroid subtraction (b) into a smoother residual distribution, enabling more accurate low-bit quantization.

Building on this, Progressive Residual Quantization iteratively refines the residual tensor across TTT stages. Starting with R(0)=XR^{(0)} = XR(0)=X, each stage applies Semantic-Aware Smoothing to the current residual to produce a new residual R(t)R^{(t)}R(t), centroids C(t)\mathbf{C}^{(t)}C(t), and assignment vector π(t)\boldsymbol{\pi}^{(t)}π(t). After TTT stages, the final residual R(T)R^{(T)}R(T) is quantized using symmetric per-group integer quantization:

XINT,SX=Q(R(T))X_{\mathrm{INT}}, S_X = Q(R^{(T)})XINT,SX=Q(R(T))

The centroids and assignment vectors from all stages are stored in global memory, while intermediate residuals are discarded. During dequantization, the process is reversed: the quantized residual is dequantized and then iteratively reconstructed by adding back the assigned centroids from stage TTT down to stage 1, yielding the final reconstructed tensor X^(0)\hat{X}^{(0)}X^(0).

This multi-stage approach allows the model to capture coarse semantic structure in early stages and fine-grained variations in later stages, leading to diminishing but cumulative reductions in quantization error. As shown in the figure, the quantization error drops from 1e21e21e2 in the original cache to 1e11e-11e1 in the final compressed representation, demonstrating the efficacy of the progressive refinement.

To support efficient deployment, the authors introduce algorithm-system co-design optimizations. They accelerate kkk-means by caching centroids from prior chunks, reducing clustering overhead by 3×. Additionally, they implement a fused dequantization kernel that reconstructs the full tensor by adding back centroids across all stages while keeping intermediate results in registers to minimize global memory access.

Experiment

  • QVG and QVG-Pro significantly reduce KV-cache memory usage (up to 7x compression) while preserving video fidelity and perceptual quality across LongCat-Video-13B, HY-WorldPlay-8B, and Self-Forcing-Wan models.
  • Both variants maintain near-lossless performance on VBench metrics (Background, Subject, Image, and Aesthetic Quality), outperforming baselines like RTN, KIVI, and QuaRot, especially under INT2 quantization.
  • QVG effectively mitigates long-horizon drift, sustaining stable image quality beyond 700 frames in Self-Forcing, whereas baselines degrade sharply after ~100 frames.
  • End-to-end latency overhead is minimal (1.5%–4.3% across models), confirming QVG does not impede generation speed.
  • Progressive Residual Quantization’s first stage delivers the largest MSE reduction; subsequent stages offer diminishing returns.
  • Larger quantization block sizes (e.g., 64) improve compression but reduce quality, while smaller blocks (e.g., 16) preserve quality at the cost of lower compression.

The authors use QVG and QVG-Pro to compress the KV cache in video generation models, achieving high compression ratios while preserving perceptual quality across multiple metrics. Results show that QVG-Pro delivers the highest fidelity scores, while QVG offers the largest memory savings with only minor quality trade-offs, outperforming all baselines. Both methods maintain near-lossless performance over long video sequences, effectively mitigating drift without introducing significant latency.


KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren
Sofort einsatzbereite GPUs
Die besten Preise

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp