Command Palette
Search for a command to run...

Abstract
We present LongLive, a frame-level autoregressive (AR) framework forreal-time and interactive long video generation. Long video generation presentschallenges in both efficiency and quality. Diffusion and Diffusion-Forcingmodels can produce high-quality videos but suffer from low efficiency due tobidirectional attention. Causal attention AR models support KV caching forfaster inference, but often degrade in quality on long videos due to memorychallenges during long-video training. In addition, beyond static prompt-basedgeneration, interactive capabilities, such as streaming prompt inputs, arecritical for dynamic content creation, enabling users to guide narratives inreal time. This interactive requirement significantly increases complexity,especially in ensuring visual consistency and semantic coherence during prompttransitions. To address these challenges, LongLive adopts a causal, frame-levelAR design that integrates a KV-recache mechanism that refreshes cached stateswith new prompts for smooth, adherent switches; streaming long tuning to enablelong video training and to align training and inference (train-long-test-long);and short window attention paired with a frame-level attention sink, shorten asframe sink, preserving long-range consistency while enabling faster generation.With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip modelto minute-long generation in just 32 GPU-days. At inference, LongLive sustains20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in bothshort and long videos. LongLive supports up to 240-second videos on a singleH100 GPU. LongLive further supports INT8-quantized inference with only marginalquality loss.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.