HyperAIHyperAI

Command Palette

Search for a command to run...

il y a un an

MegActor : Exploiter la puissance de la vidéo brute pour une animation de portrait vivante

Shurong Yang Huadong Li Juhao Wu Minhao Jing Linze Li Renhe Ji Jiajun Liang Haoqiang Fan

Déploiement en un clic de MegActor

20 heures de calcul sur RTX 5090 pour seulement $1 (valeur $7)
Aller à Notebook

Résumé

Malgré le fait que les vidéos brutes de conduite contiennent des informations plus riches sur les expressions faciales que les représentations intermédiaires telles que les repères anatomiques dans le domaine de l’animation de portraits, elles font rarement l’objet de recherches. Cela s’explique par deux défis inhérents à l’animation de portraits pilotée par des vidéos brutes : 1) une fuite d’identité significative ; 2) la présence d’informations non pertinentes sur l’arrière-plan et de détails faciaux, tels que les rides, qui dégradent les performances. Afin d’exploiter la puissance des vidéos brutes pour une animation de portraits vivante, nous proposons un modèle de diffusion conditionnelle novateur nommé MegActor. Premièrement, nous avons introduit un cadre de génération de données synthétiques pour créer des vidéos présentant des mouvements et des expressions cohérents, mais des identités (IDs) incohérentes, afin d’atténuer le problème de la fuite d’identité. Deuxièmement, nous avons segmenté le premier plan et l’arrière-plan de l’image de référence et utilisé CLIP pour encoder les détails de l’arrière-plan. Ces informations encodées sont ensuite intégrées au réseau via un module d’embedding textuel, garantissant ainsi la stabilité de l’arrière-plan. Enfin, nous avons appliqué un transfert de style de l’apparence de l’image de référence vers la vidéo de conduite afin d’éliminer l’influence des détails faciaux présents dans les vidéos de conduite. Notre modèle final a été entraîné exclusivement sur des ensembles de données publics, obtenant des résultats comparables à ceux des modèles commerciaux. Nous espérons que cela contribuera à la communauté open-source.

One-sentence Summary

MegActor is a conditional diffusion model that generates vivid portrait animations from raw driving videos by leveraging a synthetic data framework that creates videos with consistent motion and expressions but inconsistent identities to mitigate identity leakage, integrating CLIP-encoded background segmentation via text embeddings to stabilize the background, and applying appearance style transfer from the reference image to the driving video to eliminate distracting facial details, ultimately achieving performance comparable to commercial models while trained exclusively on public datasets.

Key Contributions

  • This work introduces MegActor, a conditional diffusion model that generates portrait animations from raw driving videos by employing a synthetic data generation framework to decouple motion control from subject identity and mitigate identity leakage.
  • Robustness to irrelevant background and facial details is achieved through a background segmentation and CLIP encoding module, combined with a style transfer process that maps the reference appearance onto the driving frames to filter out visual noise.
  • Trained exclusively on public datasets, the framework achieves animation quality and identity preservation comparable to commercial models, as demonstrated through SOTA comparisons and visual evaluations.

Introduction

Portrait animation enables the transfer of motion and facial expressions from a driving video to a target image while preserving identity and background, powering applications like digital avatars and AI-driven conversations. While recent diffusion-based approaches using text, image, or audio controls have improved visual quality, they struggle with subtle facial movements, rely on unstable external pose detectors, or suffer from identity leakage when trained on raw video data. To overcome these bottlenecks, the authors introduce MegActor, a conditional diffusion model that directly harnesses raw driving videos for highly expressive portrait animation. They address identity leakage through a custom synthetic data framework that decouples motion from appearance, stabilize background generation using CLIP-encoded text embeddings, and apply stylization transfer to filter out irrelevant facial details from the driving footage. This approach delivers robust, pixel-level accurate animations that match state-of-the-art commercial systems while relying solely on publicly available training data.

Dataset

  • Dataset Composition and Sources: The authors train the model using publicly available video datasets, specifically VFHQ and CelebV-HQ. To address identity and background leakage, they supplement these real videos with synthetically generated data created via Face-Fusion and SDXL.
  • Subset Details and Filtering: The real data originates from VFHQ and CelebV-HQ. The synthetic subsets include AI face-swapping videos generated by pairing each driving frame with a source image from a different individual, and stylized videos produced with SDXL. The authors also apply L2CSNet to measure gaze shifts across frames, isolating approximately 5 percent of the data that exhibits significant eye movements for specialized fine-tuning.
  • Training Usage and Mixture Ratios: During the initial training stage, the model consumes a blended mixture of 50 percent real videos, 40 percent AI face-swapping videos, and 10 percent stylized videos. The authors sample frames using a stride of 2 to create 16-frame segments, where one frame serves as the reference and the remaining frames act as both the driving input and ground truth. In the second stage, the model fine-tunes exclusively on the high-gaze subset using a stride of 12 while maintaining the 16-frame segment length.
  • Processing and Augmentation Strategies: To minimize background leakage, the authors use pyFacer to detect faces and mask all non-facial pixels to black. They also apply random augmentations to the driving videos, including grayscale conversion and adjustments to size and aspect ratio, which modify facial structure without altering expressions or head poses. All video frames are resized to 512 by 512 pixels before training.

Method

The authors leverage a conditional diffusion model architecture, referred to as MegActor, to achieve vivid portrait animation driven by raw video inputs. The overall framework is designed to address two primary challenges in using raw driving videos: identity leakage and the degradation of performance due to irrelevant background and facial details. The system operates by first processing the reference image and the driving video through distinct pipelines before integrating their features into a unified denoising process.

The reference image is processed to extract identity and background information. A ReferenceNet, based on the UNet architecture of Stable Diffusion 1.5 (SD1.5), is used to encode fine-grained identity and background features. Concurrently, the background region of the reference image is isolated and encoded using CLIP’s image encoder. This encoded background information is then integrated into the model via a text embedding module, replacing the standard text prompt. The extracted global (CLSCLSCLS) and local patch features from CLIP are merged and injected into both the ReferenceNet and the Denoising UNet through cross-attention mechanisms to stabilize the background in the generated output.

For the driving video, a lightweight DrivenEncoder is employed to extract motion features. This encoder consists of four 2D convolutional layers with varying channel sizes and is designed to efficiently process the video frames. The motion features are aligned to the resolution of the noise latents sampled from the diffusion process. To preserve the spatial structure of the pre-trained Denoising UNet, the authors initialize the parameters of the newly added channels in the conv-in layer to zero. The DrivenEncoder is further enhanced by incorporating the reference image as a guide during motion feature extraction. The latent representation of the reference image, obtained via a Variational Autoencoder (VAE), and a foreground mask derived from DensePose are concatenated with the noise latents and motion features before being fed into the Denoising UNet. This ensures that the motion transfer respects the identity of the reference character.

To improve temporal consistency across generated frames, a temporal module is inserted after each Res-Trans layer of the Denoising UNet. This module performs temporal attention between frames, capturing temporal dependencies and enhancing continuity in the animation. The temporal module is fine-tuned separately to optimize its performance without disrupting the pre-trained image generation capabilities of the base model.

The driving video undergoes preprocessing to mitigate identity leakage. A synthetic data generation framework is employed, where face-swapping and stylization techniques are applied to create videos with consistent motion and expressions but inconsistent identities. The stylized video is used during training to reduce the influence of facial details such as wrinkles. Additionally, data augmentation methods, including scaling and aspect ratio adjustments, are applied to the driving video. All non-face regions are masked out to focus the model on the facial motion.

Experiment

The evaluation validates cross-identity portrait generation by animating distinct reference frames using driving videos from multiple datasets. Initial tests on independent video sources confirm that the model accurately preserves background details and subject identity while faithfully transferring complex facial expressions and subtle head movements. A comparative assessment against a state-of-the-art baseline further highlights superior clarity in fine anatomical features and demonstrates overall animation quality on par with leading methods. These qualitative results collectively establish the model's robust generalization and competitive standing in cross-identity animation tasks.

The authors evaluate their model, MegActor, in comparison to existing methods using cross-identity video generation tasks. The results demonstrate that MegActor produces realistic portrait animations with preserved identity and detailed facial features, achieving comparable performance to state-of-the-art methods while supporting open code and weights. MegActor generates realistic portrait animations with preserved identity and detailed facial expressions under cross-identity conditions. MegActor achieves comparable results to state-of-the-art methods, producing clearer outputs in areas like teeth compared to EMO. MegActor supports open code and weights, unlike several other models listed in the comparison.

The authors evaluate MegActor through cross-identity video generation tasks to validate its capacity for producing realistic portrait animations that maintain subject identity and fine facial details. Qualitative assessments confirm that the model successfully preserves identity and renders expressive features with clarity that aligns with or exceeds existing state-of-the-art methods. Additionally, the framework establishes a new standard for accessibility by releasing fully open code and weights alongside its competitive performance.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp