HyperAIHyperAI

Command Palette

Search for a command to run...

LongLive-2.0: 긴 비디오 생성을 위한 NVFP4 병렬 인프라

초록

우리는 장편 비디오 생성의 전체 학습 및 추론 워크플로우 전반에 걸쳐 NVFP4 기반 병렬 인프라인 LongLive-2.0을 제시하며, 이는 속도와 메모리 병목 현상을 해결합니다. 학습을 위해 우리는 시퀀스 병렬 자기회귀(AR) 학습을 도입하였으며, 이는 Balanced SP로 구현됩니다. Balanced SP는 각 랭크에서 깨끗한 히스토리 청크와 노이즈가 포함된 타겟 청크를 페어링하여 효율적인 교사 강제 교습(teacher-forcing) 레이아웃과 SP 실행을 공동 설계하며, 이를 통해 SP 인식 청크화 VAE 인코딩과 자연스럽게 호환되는 교사 강제 교습 마스크를 가능하게 합니다. NVFP4 정밀도와 결합하여 이는 학습 중 GPU 메모리 비용을 줄이고 비디오 길이가 증가함에 따라 그 비중이 커지는 GEMM 연산을 가속화합니다. 또한, 우리는 고품질 인프라와 데이터셋이 remarkably 깨끗한 학습 파이프라인을 가능하게 함을 보여줍니다. 기존 ODE 초기화와 이후 분포 일치 증류(DMD)에 의존하는 Self-Forcing 계열 방법들과 달리, LongLive-2.0은 확산 모델을 긴 다중 샷(interactive) 자기회귀(AR) 확산 모델로 직접 튜닝합니다. 이는 독립적인 LoRA 가중치를 사용하여 실시간 생성(4단계에서 2단계의 디노이징 단계)으로 변환될 수 있습니다. Blackwell GPU에서의 추론을 위해 우리는 W4A4 NVFP4 추론을 활성화하고, 메모리 절약을 위해 KV 캐시를 NVFP4로 양자화하며, 비동기 스트리밍 VAE 디코딩을 통해 엔드투엔드 처리량을 향상시킵니다. Blackwell이 아닌 GPU 아키텍처에서는 SP 추론을 배포하여 Blackwell GPU와 동일한 속도를 달성하고, 양자화된 KV 캐시는 SP의 GPU 간 통신을 줄일 수 있습니다. 실험 결과는 학습에서 최대 2.15배, 추론에서 1.84배의 속도 향상을 보여줍니다. LongLive-2.0-5B은 벤치마크에서 강력한 성능을 달성하면서 45.7 FPS의 추론 속도를 달성합니다. 저희의 지식에 따르면, LongLive-2.0은 장편 비디오 생성을 위한 최초의 NVFP4 학습 및 추론 시스템입니다.

One-sentence Summary

LongLive-2.0 presents an NVFP4-based parallel infrastructure that accelerates long video generation by combining sequence-parallel autoregressive training and W4A4 NVFP4 inference to directly convert diffusion models into interactive autoregressive systems without ODE initialization or distillation, achieving up to 2.15× training and 1.84× inference speedups while enabling the 5B variant to reach 45.7 FPS.

Key Contributions

  • The paper introduces Balanced SP, a sequence-parallel autoregressive training framework that co-designs teacher-forcing layouts with parallel execution by pairing clean-history and noisy-target temporal chunks per rank. This architecture enables SP-aware chunked VAE encoding and directly fine-tunes a diffusion model into a multi-shot interactive autoregressive system without relying on ODE initialization or distribution matching distillation.
  • The system establishes an end-to-end W4A4 NVFP4 inference pipeline that compresses the KV cache into NVFP4 and integrates asynchronous streaming VAE decoding to maximize throughput on Blackwell GPUs. It extends sequence-parallel inference to non-Blackwell architectures to maintain generation speed while reducing inter-GPU communication overhead.
  • Experimental evaluations demonstrate up to 2.15x training acceleration and 1.84x inference speedup, with the LongLive-2.0-5B model achieving 45.7 FPS and strong performance across standard benchmarks. The framework further enables real-time generation by converting the trained model to two to four denoising steps using standalone LoRA adapters.

Introduction

Causal autoregressive synthesis has become the standard for streaming long video generation, offering scalable frame-by-frame creation with real-time potential. Despite advances in mitigating exposure bias and temporal drift, prior methods face persistent bottlenecks in memory management, cache overhead, and a critical mismatch between training precision and deployment efficiency. While low-bit formats like FP4 have successfully compressed large language models, they remain largely untested for video diffusion, where extended spatio-temporal sequences, repeated denoising cycles, and growing key-value caches demand strict precision alignment. The authors leverage a unified NVFP4 quantization framework to resolve these bottlenecks, jointly stabilizing training, enabling weight-and-activation 4-bit inference, compressing key-value cache storage, and streamlining long-video deployment.

Dataset

  • Dataset Composition and Sources

    • The authors curate a large-scale long-video dataset from raw footage to train LongLive-2.0.
    • The final collection contains 120,000 videos, each segmented into independent shots.
  • Subset Details and Distribution

    • The dataset is evenly divided into three duration groups: 16 to 32 seconds, 32 to 64 seconds, and over 64 seconds, with each category representing one-third of the total volume.
  • Filtering and Quality Control

    • The authors remove samples exhibiting excessively short shots, logos, watermarks, prominent text, severe camera shake, abnormal playback speeds, exposure issues, blur, and low-motion clips with frozen frames or minimal zoom.
    • Visual quality is assessed using the MANIQA metric, where the average score across sampled frames determines the overall rating. Only the highest-ranked videos are retained.
  • Metadata Construction and Processing

    • Each shot receives structured captions covering visual elements, scene context, characters, actions, and cinematography.
    • The authors merge captions from all shots within a single video and refine the combined text to ensure temporal coherence and logical consistency across consecutive frames and scenes.
  • Model Usage

    • The curated dataset serves as the primary training resource for LongLive-2.0, with the authors leveraging the shot-level annotations and temporally aligned long-form descriptions to optimize model performance.

Method

The LongLive-2.0 framework presents a co-designed infrastructure for efficient long video generation, integrating a novel training methodology with a parallel inference system. The core of the training process is sequence-parallel autoregressive (AR) training, instantiated as Balanced SP, which co-designs the data layout with the sequence-parallel execution to address memory and computational bottlenecks. This approach ensures that each GPU rank is responsible for both clean and noisy latent tokens from the same temporal chunk, balancing the loss-bearing workload across devices and enabling a natural teacher-forcing attention mask. This paired layout is applied consistently across VAE encoding, latent construction, and loss computation, eliminating the need for replicated VAE preparation and ensuring that the sequence sharding is aligned with the DiT's attention mechanism. The training process is further accelerated by NVFP4 precision, which reduces memory footprint and speeds up GEMM operations, particularly as video length increases. The authors leverage this infrastructure to directly fine-tune a bidirectional diffusion model into a long, interactive, multi-shot AR model, bypassing the complex multi-stage processes of prior methods. The resulting model can be converted to real-time generation with few-step denoising using standalone LoRA weights, which are derived through a simplified DMD distillation process that optimizes only the LoRA adapters.

For inference, LongLive-2.0 employs a multi-faceted strategy to achieve high throughput and low latency. On Blackwell GPUs, the system enables W4A4 NVFP4 inference, quantizing both the model weights and the key-value (KV) cache to NVFP4, which significantly reduces memory usage and accelerates computation. The KV cache is quantized at the frame-chunk level, with each chunk containing eight frames, and a customized parallel CUDA dequantization kernel is used to reconstruct the cache for efficient in-window attention. To further improve throughput, the framework implements asynchronous streaming VAE decoding. This heterogeneous pipeline dedicates one GPU to VAE decoding, which runs concurrently with the DiT inference cluster, effectively hiding the decoding latency behind the dominant DiT denoising steps. This design reduces end-to-end latency and enables memory-efficient streaming generation. For non-Blackwell GPU architectures, LongLive-2.0 deploys sequence-parallel inference to match the speed of Blackwell GPUs, with the quantized KV cache reducing inter-GPU communication overhead. The system also introduces a multi-shot attention sink mechanism to maintain coherence during multi-shot generation. This mechanism uses two cooperating anchor sets: a global sink to preserve the identity of the entire video and a shot-level sink to maintain local coherence within each shot. This design integrates seamlessly with the chunk-wise interactive prompting interface, allowing for minute-scale interactive generation without redundant recomputation. The overall architecture is designed to maximize end-to-end generation speed, a more practical metric than diffusion-model FPS alone, by minimizing the overhead of low-bit KV computation and overlapping VAE decoding with model denoising.

Experiment

The evaluation setup assesses training and inference efficiency, generation quality across short and long videos, and the impact of key architectural and precision design choices. Experiments validate that combining balanced sequence parallelism with NVFP4 quantization significantly reduces memory usage and accelerates training while preserving visual fidelity. Qualitative ablations further confirm that the multi-shot attention mechanism prevents temporal drift and maintains subject consistency across extended sequences, whereas pre-training the quantized precision avoids the detail degradation associated with post-training conversion. Collectively, these findings demonstrate that the optimized framework enables high-throughput, real-time video generation with robust long-range stability and minimal quality loss.

The authors evaluate inference efficiency across different settings, comparing performance metrics such as frames per second, end-to-end generation time, and memory usage across various video lengths. Results show that reducing denoising steps and applying NVFP4 with KV cache and asynchronous decoding improves throughput and reduces latency while maintaining consistent memory footprint. Reducing denoising steps significantly increases inference speed, with the 2-step setting achieving the highest frames per second. NVFP4 with KV cache and asynchronous decoding maintains low memory usage while substantially reducing end-to-end generation time. The 2-step configuration achieves the fastest inference speed and lowest end-to-end latency across all video lengths.

The authors analyze training efficiency by comparing different parallelism strategies and precision settings, showing that sequence parallelism enables longer video training and that combining it with NVFP4 quantization significantly reduces iteration time and memory usage. Results demonstrate that the proposed methods improve scalability and efficiency across various video lengths, with the most significant gains observed at longer sequences. Sequence parallelism enables training on longer videos by reducing memory usage and iteration time compared to baseline methods. Combining NVFP4 quantization with balanced sequence parallelism achieves the fastest training iteration times and lowest memory costs. The proposed approach shows the most significant improvements in efficiency at the longest video lengths, with substantial reductions in both time and memory requirements.

The authors evaluate different precision configurations for video generation models, focusing on memory usage and efficiency. The results show that combining NVFP4 with LoRA reduces peak memory significantly compared to BF16, with the most substantial improvement observed in the final configuration. The ratio of memory reduction indicates a clear efficiency gain when using NVFP4+LoRA over the baseline. NVFP4+LoRA reduces peak memory usage compared to BF16 and NVFP4 configurations. The combination of NVFP4 and LoRA achieves the lowest memory footprint among the tested setups. The memory reduction ratio shows a significant improvement over the baseline configuration.

The authors present a series of ablation studies on training and inference efficiency, focusing on sequence parallelism, quantization, and KV-cache compression. Results show that combining these techniques significantly reduces end-to-end generation time and memory usage, particularly for longer sequences, while maintaining or improving performance. The proposed methods enable efficient high-resolution video generation with reduced latency and memory footprint. Combining sequence parallelism with quantization and KV-cache compression reduces end-to-end generation time and memory usage across different sequence lengths. The integration of NVFP4 quantization and KV-cache compression leads to substantial improvements in inference efficiency with minimal latency cost. The proposed methods achieve strong performance in long-video generation, demonstrating superior consistency and quality compared to baselines.

The authors compare different precision configurations for video generation models, focusing on training and inference efficiency. Results show that using NVFP4 with pre-trained quantization maintains high quality while reducing memory usage and improving speed, particularly when combined with fewer denoising steps. NVFP4 with pre-trained quantization achieves high quality and efficiency, matching BF16 performance while reducing memory usage. Fewer denoising steps improve inference speed significantly, enabling real-time video generation. Pre-trained NVFP4 outperforms post-training quantization in preserving visual details and maintaining quality.

The authors evaluate inference and training efficiency across varying video lengths by systematically testing precision configurations, sequence parallelism, and decoding optimizations. These experiments validate that integrating NVFP4 quantization with sequence parallelism, KV-cache compression, and reduced denoising steps substantially accelerates model execution while significantly lowering memory consumption. The combined approaches enable scalable, high-resolution video generation with minimal latency and preserved visual quality, demonstrating consistent advantages particularly for long-sequence tasks.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp