Command Palette
Search for a command to run...
Technischer Bericht zu LongCat-Video-Avatar 1.5
Technischer Bericht zu LongCat-Video-Avatar 1.5
Meituan LongCat Team
LongCat-Video-Avatar 1.5 Digitales Menschenmodell
Zusammenfassung
Trotz der Fortschritte in der audiogetriebenen Videoerstellung bleibt die Erreichung einer stabilitätsreifen, kommerziellen Qualität eine Herausforderung. Wir präsentieren LongCat-Video-Avatar 1.5, einen erweiterten Open-Source-Rahmenwerk, das systematische Ingenieurskunst und Produktionsreife über architektonische Novelty stellt. Durch den Upgrade des Audio-Encoders zu Whisper Large und das sorgfältige Skalieren unserer Trainingsrezepte erreicht v1.5 eine präzise Lippen-Synchronisation, vollkörperliche zeitliche Stabilität und robuste Long-Video-Erstellung mit strikter Identitätskonsistenz. Durch strenge Datenkuratierung und RLHF-Training generalisiert das Modell leicht auf stilisierte Domänen wie Anime und Tiere und handhabt nativ komplexe reale Bedingungen – wie Mehrpersoneninteraktionen und Gegenstandsmanipulation. Darüber hinaus, adressiert die praktischen Anforderungen der industriellen Implementierung, verwenden wir fortschrittliche Schritt-Verdünnung, um die Inferenz auf ein optimales 8 NFE zu beschleunigen, wobei ein günstiger Kompromiss zwischen Serving-Effizienz und visueller Treue erzielt wird. Die Überlegenheit unseres Ansatzes wird durch umfangreiche quantitative Metriken und eine rigorose menschliche Evaluierung, die an einem umfassenden Benchmark mit über 500 verschiedenen Testfällen durchgeführt wurde, validiert. Die Ergebnisse zeigen, dass v1.5 wettbewerbsfähige oder überlegene Leistung gegenüber führenden Closed-Source-Systemen (z.B. HeyGen, OmniHuman 1.5, Kling Avatar 2.0) in Bezug auf Human-Likeness-Bewertungen und Expert-Levels-Qualitätsbewertungen in unserem Benchmark erzielt. Mit seiner Open-Source-Veröffentlichung schließt LongCat-Video-Avatar 1.5 die Lücke zwischen akademischen Forschungsprototypen und kommerziell einsatzfähiger Implementierung.
One-sentence Summary
Meituan LongCat Team presents LongCat-Video-Avatar 1.5, an open-source framework prioritizing production readiness for audio-driven video generation that integrates Whisper Large audio encoders, RLHF training, and step distillation to accelerate inference to 8 NFE while ensuring accurate lip-synchronization, full-body temporal stability, and strict identity consistency across stylized and real-world domains, validated through rigorous human evaluation on over 500 diverse test cases where it demonstrates competitive or superior performance against leading closed-source systems such as HeyGen, OmniHuman 1.5, and Kling Avatar 2.0, ultimately narrowing the gap between academic research prototypes and commercial-grade deployment.
Key Contributions
- The paper presents LongCat-Video-Avatar 1.5, an open-source framework designed for commercial-grade stability in audio-driven video generation. Upgrading the audio encoder to Whisper Large and scaling training recipes enables accurate lip-synchronization and strict identity consistency in long videos.
- Advanced step distillation is employed to accelerate inference to an optimal 8 NFE, balancing serving efficiency with visual fidelity. Group-Relative Policy Optimization is further integrated to enhance generation quality and generalization to stylized domains.
- Rigorous validation is provided through extensive quantitative metrics and human evaluation on a comprehensive benchmark of over 500 diverse test cases. The model demonstrates competitive or superior performance against leading closed-source systems across human-likeness and expert-level quality assessments.
Introduction
Audio-driven human animation is essential for digital humans and virtual communication, yet achieving commercial-grade stability remains a significant hurdle. Existing models often struggle with long-horizon identity consistency and robustness in complex scenarios like multi-person interactions or object handling. The authors present LongCat-Video-Avatar 1.5, an open-source framework designed to bridge the gap between academic prototypes and production-ready systems. They leverage an upgraded Whisper Large audio encoder to improve lip synchronization and employ Group-Relative Policy Optimization to align generation quality with human preferences. Additionally, the team implements advanced step distillation to accelerate inference to just 8 NFEs while maintaining visual fidelity. This systematic engineering approach enables the model to outperform leading closed-source systems in naturalness and stability across diverse benchmarks.
Dataset
- The authors construct a multi-stage general data pipeline to support stable and controllable single-person avatar generation.
- Raw videos are organized by functional contribution across six categories including close-up face, interview, acted performance, interaction, music, and animation.
- A unified annotation schema maps heterogeneous videos into a reusable representation space covering human structure, audio quality, and visual artifacts.
- Offline annotation precomputes stable attributes such as face location and lip synchronization confidence to enable content-based selection.
- Online validation performs clip-level quality control to filter duration, brightness, and motion consistency before training inputs are finalized.
- Three specialized pipelines address specific challenges beyond the general framework to improve generation quality.
- Multi-person data utilizes ByteTrack and Active Speaker Detection to isolate non-overlapping single-speaker segments and exclude concurrent activity.
- Silent data requires consensus between Qwen3-Omni and Qwen3-VL models to confirm non-speaking states across all clips in a video.
- Emotion data follows a six-category taxonomy refined by EmotiEffLib with a confidence threshold greater than 0.7.
- Hard exclusion rules remove synthetic content or videos with multiple subjects from the emotion subset.
- Context-aware annotations describe spatial environment and physical movement evolution objectively for the emotion subset.
- Structured metadata is converted into textual conditions that include camera behavior and visual style alongside content descriptions.
- Training stages select samples based on task-specific attributes like body composition or lip visibility rather than coarse source rules.
- This approach enables the model to learn relationships among semantic content, human motion, and camera language through interpretable filtering.
Method
The authors leverage a unified DiT-based video diffusion architecture, inheriting the foundation from LongCat-Video-Avatar 1.0. The model is constructed upon a 3D Variational Autoencoder (VAE), where each Diffusion Transformer (DiT) block integrates 3D self-attention, text cross-attention, and a Feed-Forward Network (FFN). Text embeddings are processed via a UMT5 encoder, while 3D Rotary Position Embeddings (RoPE) are applied to visual tokens to capture spatiotemporal positional information. The complete network architecture is illustrated below.
The unified architecture supports multiple audio-driven human animation tasks through flexible input configurations. The network accepts three types of latent sequences: reference latents for identity, motion latents for context, and noise latents for generation. For text-to-video tasks, only noise latents are provided. For text-image-to-video generation, the reference latent is temporally concatenated with noise latents. For video continuation, context latents are concatenated with noise latents to serve as conditioning signals. To enable audio-driven generation, an additional audio cross-attention layer is inserted after the text cross-attention module within each DiT block. An Adaptive Layer Normalization (adaLN) module precedes this layer to function as a gating mechanism, ensuring stable optimization and preventing catastrophic forgetting of visual priors while aligning audio signals with mouth movements.
For audio feature extraction, the system upgrades from Wav2Vec2 to Whisper-large, leveraging its 1.5B parameters and multilingual robustness. To handle audio streams exceeding the 30-second context limit, a sliding window strategy partitions the input spectrogram. The resulting hidden states undergo grouped mean pooling to compress the representation into a compact 5-channel feature set. These features are temporally resampled to 25 FPS and passed through an audio projector to match the video latent sequence length, ensuring strict temporal alignment.
The training pipeline consists of three progressive stages: Base Model Training, RLHF Training, and Acceleration Training. Base Model Training utilizes a flow matching framework to synthesize temporally coherent and identity-preserving video conditioned on speech. This stage progresses from low-resolution pretraining to high-resolution synthesis, followed by the introduction of reference images for identity preservation and multi-person dialogue datasets for conversational scenarios.
Following base training, the model undergoes Reinforcement Learning from Human Feedback (RLHF) using Group-Relative Policy Optimization (GRPO). This approach shifts from video-level to per-frame reward modeling, decomposing rewards along temporal partitions. The effective relative advantage is calculated as a weighted sum of individual relative advantages, allowing for finer-grained credit assignment to address local motion inconsistencies or structural collapse. The training also incorporates a first-frame hand-presence check and a multi-clip rollout strategy to support long-horizon video continuation.
Finally, Acceleration Training employs Distribution Matching Distillation 2 (DMD2) to distill the multi-step diffusion model into an efficient few-step generator. To overcome VRAM bottlenecks, a parameter-efficient architecture is used where a single base DiT backbone is equipped with multiple LoRA adapters (Generator and Fake Score). The model is distilled to 8 denoising steps, balancing inference speed with generation quality. For multi-person conversations, the L-RoPE mechanism associates specific speaker regions with their audio conditions, while a silent audio track is assigned to background characters to prevent unintended lip movements.
Experiment
The study establishes a human evaluation benchmark comprising 508 image and audio pairs to assess virtual human generation across dimensions such as rationality and harmony. Qualitative results indicate that the proposed LC-Video-Avatar 1.5 model achieves superior stability and identity preservation compared to state-of-the-art methods, although physical rationality and synchronization gaps persist across the industry. Additionally, comparisons between the base and accelerated versions highlight a trade-off where the faster variant prioritizes visual stability while the base model offers greater motion diversity and lip alignment accuracy.
The the the table compares a standard Base model against an accelerated Fast variant, revealing a distinct trade-off between expressive richness and generation stability. While the Base model achieves slightly higher human-likeness scores in single-person scenarios and better audio-visual harmony, the Fast variant significantly outperforms it in stability and physical rationality metrics. The Fast model demonstrates significantly lower issue rates in stability and rationality compared to the Base model. The Base model retains a slight advantage in single-person human-likeness scores and harmony metrics. Multi-person human-likeness scores are marginally higher for the Fast variant compared to the Base model.
This experiment evaluates a standard Base model against an accelerated Fast variant to highlight the trade-off between expressive richness and generation stability. The Base model retains a slight edge in single-person human-likeness and audio-visual harmony, whereas the Fast variant significantly outperforms it in stability and physical rationality with fewer issues. Furthermore, the Fast model achieves marginally higher multi-person human-likeness scores, indicating its effectiveness in more complex interactions.