HyperAIHyperAI

Command Palette

Search for a command to run...

LongLive-2.0 : Une infrastructure parallèle NVFP4 pour la génération de vidéos longues

Résumé

Nous présentons LongLive-2.0, une infrastructure parallèle basée sur NVFP4 couvrant l’ensemble du processus d’entraînement et d’inférence de la génération de vidéos longues, afin de résoudre les goulots d’étranglement liés à la vitesse et à la mémoire. Pour l’entraînement, nous introduisons l’entraînement autoregressif (AR) en parallèle de séquence, instancié sous la forme de Balanced SP, qui co-conçoit la disposition efficace du forçage enseignant (teacher-forcing) avec l’exécution SP en associant des blocs temporels d’historique propre et de cible bruitée sur chaque rang, permettant ainsi un masque de forçage enseignant naturel avec un encodage VAE par blocs conscient de SP. Combinée à la précision NVFP4, cette approche réduit le coût mémoire des GPU et accélère le calcul GEMM pendant l’entraînement, dont la proportion augmente avec la longueur de la vidéo. De plus, nous montrons qu’une infrastructure et un jeu de données de haute qualité permettent un pipeline d’entraînement remarquablement épuré. Contrairement aux méthodes existantes de la série Self-Forcing qui reposent sur une initialisation par équations différentielles ordinaires (ODE) et une distillation par appariement de distributions ultérieure (DMD), LongLive-2.0 affine directement un modèle de diffusion en un modèle de diffusion autoregressif (AR) long, multi-plans et interactif. Il peut être converti en génération en temps réel (de 4 à 2 étapes de débruitage) à l’aide de poids LoRA autonomes. Pour l’inférence sur les GPU Blackwell, nous activons l’inférence NVFP4 W4A4, quantisons le cache KV en NVFP4 pour économiser de la mémoire, et augmentons le débit de bout en bout grâce au décodage VAE en streaming asynchrone. Sur les architectures GPU non Blackwell, nous déployons l’inférence SP pour atteindre une vitesse comparable à celle des GPU Blackwell, tandis que le cache KV quantifié peut réduire les communications inter-GPU de SP. Les expériences montrent une accélération allant jusqu’à 2,15 fois pour l’entraînement et 1,84 fois pour l’inférence. LongLive-2.0-5B atteint 45,7 FPS en inférence tout en obtenant des performances solides sur les benchmarks. À notre connaissance, LongLive-2.0 est le premier système d’entraînement et d’inférence NVFP4 pour la génération de vidéos longues.

One-sentence Summary

LongLive-2.0 presents an NVFP4-based parallel infrastructure that accelerates long video generation by combining sequence-parallel autoregressive training and W4A4 NVFP4 inference to directly convert diffusion models into interactive autoregressive systems without ODE initialization or distillation, achieving up to 2.15× training and 1.84× inference speedups while enabling the 5B variant to reach 45.7 FPS.

Key Contributions

  • The paper introduces Balanced SP, a sequence-parallel autoregressive training framework that co-designs teacher-forcing layouts with parallel execution by pairing clean-history and noisy-target temporal chunks per rank. This architecture enables SP-aware chunked VAE encoding and directly fine-tunes a diffusion model into a multi-shot interactive autoregressive system without relying on ODE initialization or distribution matching distillation.
  • The system establishes an end-to-end W4A4 NVFP4 inference pipeline that compresses the KV cache into NVFP4 and integrates asynchronous streaming VAE decoding to maximize throughput on Blackwell GPUs. It extends sequence-parallel inference to non-Blackwell architectures to maintain generation speed while reducing inter-GPU communication overhead.
  • Experimental evaluations demonstrate up to 2.15x training acceleration and 1.84x inference speedup, with the LongLive-2.0-5B model achieving 45.7 FPS and strong performance across standard benchmarks. The framework further enables real-time generation by converting the trained model to two to four denoising steps using standalone LoRA adapters.

Introduction

Causal autoregressive synthesis has become the standard for streaming long video generation, offering scalable frame-by-frame creation with real-time potential. Despite advances in mitigating exposure bias and temporal drift, prior methods face persistent bottlenecks in memory management, cache overhead, and a critical mismatch between training precision and deployment efficiency. While low-bit formats like FP4 have successfully compressed large language models, they remain largely untested for video diffusion, where extended spatio-temporal sequences, repeated denoising cycles, and growing key-value caches demand strict precision alignment. The authors leverage a unified NVFP4 quantization framework to resolve these bottlenecks, jointly stabilizing training, enabling weight-and-activation 4-bit inference, compressing key-value cache storage, and streamlining long-video deployment.

Dataset

  • Dataset Composition and Sources

    • The authors curate a large-scale long-video dataset from raw footage to train LongLive-2.0.
    • The final collection contains 120,000 videos, each segmented into independent shots.
  • Subset Details and Distribution

    • The dataset is evenly divided into three duration groups: 16 to 32 seconds, 32 to 64 seconds, and over 64 seconds, with each category representing one-third of the total volume.
  • Filtering and Quality Control

    • The authors remove samples exhibiting excessively short shots, logos, watermarks, prominent text, severe camera shake, abnormal playback speeds, exposure issues, blur, and low-motion clips with frozen frames or minimal zoom.
    • Visual quality is assessed using the MANIQA metric, where the average score across sampled frames determines the overall rating. Only the highest-ranked videos are retained.
  • Metadata Construction and Processing

    • Each shot receives structured captions covering visual elements, scene context, characters, actions, and cinematography.
    • The authors merge captions from all shots within a single video and refine the combined text to ensure temporal coherence and logical consistency across consecutive frames and scenes.
  • Model Usage

    • The curated dataset serves as the primary training resource for LongLive-2.0, with the authors leveraging the shot-level annotations and temporally aligned long-form descriptions to optimize model performance.

Method

The LongLive-2.0 framework presents a co-designed infrastructure for efficient long video generation, integrating a novel training methodology with a parallel inference system. The core of the training process is sequence-parallel autoregressive (AR) training, instantiated as Balanced SP, which co-designs the data layout with the sequence-parallel execution to address memory and computational bottlenecks. This approach ensures that each GPU rank is responsible for both clean and noisy latent tokens from the same temporal chunk, balancing the loss-bearing workload across devices and enabling a natural teacher-forcing attention mask. This paired layout is applied consistently across VAE encoding, latent construction, and loss computation, eliminating the need for replicated VAE preparation and ensuring that the sequence sharding is aligned with the DiT's attention mechanism. The training process is further accelerated by NVFP4 precision, which reduces memory footprint and speeds up GEMM operations, particularly as video length increases. The authors leverage this infrastructure to directly fine-tune a bidirectional diffusion model into a long, interactive, multi-shot AR model, bypassing the complex multi-stage processes of prior methods. The resulting model can be converted to real-time generation with few-step denoising using standalone LoRA weights, which are derived through a simplified DMD distillation process that optimizes only the LoRA adapters.

For inference, LongLive-2.0 employs a multi-faceted strategy to achieve high throughput and low latency. On Blackwell GPUs, the system enables W4A4 NVFP4 inference, quantizing both the model weights and the key-value (KV) cache to NVFP4, which significantly reduces memory usage and accelerates computation. The KV cache is quantized at the frame-chunk level, with each chunk containing eight frames, and a customized parallel CUDA dequantization kernel is used to reconstruct the cache for efficient in-window attention. To further improve throughput, the framework implements asynchronous streaming VAE decoding. This heterogeneous pipeline dedicates one GPU to VAE decoding, which runs concurrently with the DiT inference cluster, effectively hiding the decoding latency behind the dominant DiT denoising steps. This design reduces end-to-end latency and enables memory-efficient streaming generation. For non-Blackwell GPU architectures, LongLive-2.0 deploys sequence-parallel inference to match the speed of Blackwell GPUs, with the quantized KV cache reducing inter-GPU communication overhead. The system also introduces a multi-shot attention sink mechanism to maintain coherence during multi-shot generation. This mechanism uses two cooperating anchor sets: a global sink to preserve the identity of the entire video and a shot-level sink to maintain local coherence within each shot. This design integrates seamlessly with the chunk-wise interactive prompting interface, allowing for minute-scale interactive generation without redundant recomputation. The overall architecture is designed to maximize end-to-end generation speed, a more practical metric than diffusion-model FPS alone, by minimizing the overhead of low-bit KV computation and overlapping VAE decoding with model denoising.

Experiment

The evaluation setup assesses training and inference efficiency, generation quality across short and long videos, and the impact of key architectural and precision design choices. Experiments validate that combining balanced sequence parallelism with NVFP4 quantization significantly reduces memory usage and accelerates training while preserving visual fidelity. Qualitative ablations further confirm that the multi-shot attention mechanism prevents temporal drift and maintains subject consistency across extended sequences, whereas pre-training the quantized precision avoids the detail degradation associated with post-training conversion. Collectively, these findings demonstrate that the optimized framework enables high-throughput, real-time video generation with robust long-range stability and minimal quality loss.

The authors evaluate inference efficiency across different settings, comparing performance metrics such as frames per second, end-to-end generation time, and memory usage across various video lengths. Results show that reducing denoising steps and applying NVFP4 with KV cache and asynchronous decoding improves throughput and reduces latency while maintaining consistent memory footprint. Reducing denoising steps significantly increases inference speed, with the 2-step setting achieving the highest frames per second. NVFP4 with KV cache and asynchronous decoding maintains low memory usage while substantially reducing end-to-end generation time. The 2-step configuration achieves the fastest inference speed and lowest end-to-end latency across all video lengths.

The authors analyze training efficiency by comparing different parallelism strategies and precision settings, showing that sequence parallelism enables longer video training and that combining it with NVFP4 quantization significantly reduces iteration time and memory usage. Results demonstrate that the proposed methods improve scalability and efficiency across various video lengths, with the most significant gains observed at longer sequences. Sequence parallelism enables training on longer videos by reducing memory usage and iteration time compared to baseline methods. Combining NVFP4 quantization with balanced sequence parallelism achieves the fastest training iteration times and lowest memory costs. The proposed approach shows the most significant improvements in efficiency at the longest video lengths, with substantial reductions in both time and memory requirements.

The authors evaluate different precision configurations for video generation models, focusing on memory usage and efficiency. The results show that combining NVFP4 with LoRA reduces peak memory significantly compared to BF16, with the most substantial improvement observed in the final configuration. The ratio of memory reduction indicates a clear efficiency gain when using NVFP4+LoRA over the baseline. NVFP4+LoRA reduces peak memory usage compared to BF16 and NVFP4 configurations. The combination of NVFP4 and LoRA achieves the lowest memory footprint among the tested setups. The memory reduction ratio shows a significant improvement over the baseline configuration.

The authors present a series of ablation studies on training and inference efficiency, focusing on sequence parallelism, quantization, and KV-cache compression. Results show that combining these techniques significantly reduces end-to-end generation time and memory usage, particularly for longer sequences, while maintaining or improving performance. The proposed methods enable efficient high-resolution video generation with reduced latency and memory footprint. Combining sequence parallelism with quantization and KV-cache compression reduces end-to-end generation time and memory usage across different sequence lengths. The integration of NVFP4 quantization and KV-cache compression leads to substantial improvements in inference efficiency with minimal latency cost. The proposed methods achieve strong performance in long-video generation, demonstrating superior consistency and quality compared to baselines.

The authors compare different precision configurations for video generation models, focusing on training and inference efficiency. Results show that using NVFP4 with pre-trained quantization maintains high quality while reducing memory usage and improving speed, particularly when combined with fewer denoising steps. NVFP4 with pre-trained quantization achieves high quality and efficiency, matching BF16 performance while reducing memory usage. Fewer denoising steps improve inference speed significantly, enabling real-time video generation. Pre-trained NVFP4 outperforms post-training quantization in preserving visual details and maintaining quality.

The authors evaluate inference and training efficiency across varying video lengths by systematically testing precision configurations, sequence parallelism, and decoding optimizations. These experiments validate that integrating NVFP4 quantization with sequence parallelism, KV-cache compression, and reduced denoising steps substantially accelerates model execution while significantly lowering memory consumption. The combined approaches enable scalable, high-resolution video generation with minimal latency and preserved visual quality, demonstrating consistent advantages particularly for long-sequence tasks.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp