il y a 4 heures

Table des matières

Résumé

Le montage d'expressions faciales à grain fin a longtemps été limité par un chevauchement sémantique intrinsèque. Pour y remédier, nous avons construit le jeu de données Flex Facial Expression (FFE), doté d'annotations affectives continues, et établi FFE-Bench pour évaluer la confusion structurelle, la précision du montage, la contrôlabilité linéaire ainsi que le compromis entre le montage de l'expression et la préservation de l'identité. Nous proposons PixelSmile, un cadre basé sur Diffusion qui désenchevêtre les sémantiques de l'expression grâce à un entraînement conjoint pleinement symétrique. PixelSmile associe une supervision par l'intensité à un apprentissage contrastif afin de générer des expressions plus intenses et plus distinctives, permettant un contrôle linéaire précis et stable des expressions par interpolation dans l'espace latent textuel. De nombreuses expériences démontrent que PixelSmile offre un désenchevêtrement supérieur et une préservation robuste de l'identité, confirmant son efficacité pour un montage d'expressions continu, contrôlable et à grain fin, tout en supportant naturellement un mélange fluide des expressions.

One-sentence Summary

Researchers from Fudan University and StepFun introduce PixelSmile, a Diffusion framework that resolves semantic overlap in facial editing by employing symmetric joint training and contrastive learning. This approach enables precise, continuous control over expression intensity and seamless blending while robustly preserving identity across real-world and anime domains.

Key Contributions

The paper introduces the Flex Facial Expression (FFE) dataset with continuous 12-dimensional affective annotations and establishes FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation.
A novel diffusion framework named PixelSmile is presented, which employs fully symmetric joint training and contrastive learning to disentangle overlapping expression semantics and produce stronger, more distinguishable facial expressions.
Experiments demonstrate that the proposed method achieves precise and stable linear expression control through textual latent interpolation while maintaining robust identity preservation and supporting smooth expression blending across real-world and anime domains.

Introduction

Facial expression editing is critical for realistic portrait manipulation in applications ranging from digital avatars to content creation, yet current diffusion-based models struggle with fine-grained control. Prior approaches rely on discrete one-hot labels that force continuous human emotions into rigid categories, causing semantic entanglement where similar expressions like fear and surprise become indistinguishable and leading to identity drift during editing. The authors address these limitations by introducing the Flex Facial Expression (FFE) dataset with continuous affective annotations and proposing PixelSmile, a diffusion framework that uses symmetric joint training and textual latent interpolation to disentangle overlapping emotions for precise, linearly controllable editing.

Dataset

Dataset Composition and Sources: The authors construct the FFE dataset through a four-stage pipeline to support fine-grained facial expression editing across real-world and anime domains. The final collection comprises 60,000 images, split evenly with 30,000 images per domain. The real-world subset draws from public portrait datasets like the Human Images Dataset and Matting Human Dataset, while the anime subset is curated from 207 productions covering 629 characters.
Key Details for Each Subset:
- Real Domain: Contains approximately 6,000 base identities featuring diverse demographics and scene compositions, including close-ups and full-body shots.
- Anime Domain: Includes stylized portraits from various styles such as CG, 2D, chibi, and manga, with a flatter age distribution but higher ambiguity in gender and age labels.
- Expression Variety: Both subsets utilize a structured prompt library for 12 target expressions (six basic and six extended emotions) decomposed into facial attributes to ensure anatomical consistency.
Data Usage and Training Strategy:
- Generation: The authors use the Nano Banana Pro image editing model to synthesize multiple target expressions with varying intensities for each base identity.
- Training Framework: A joint fully symmetric training approach is adopted where the model samples a source image and a confusing expression pair to construct triplets.
- Loss Components: The training objective combines a Flow-Matching loss for intensity alignment, a contrastive loss for expression separation, and an identity preservation loss.
- Ablation Study: For comparison, the authors also process the MEAD dataset by sampling front-view frames and mapping discrete intensity levels to continuous values.
Processing and Annotation Details:
- Continuous Annotation: Instead of one-hot labels, each image receives a 12-dimensional continuous score vector predicted by the Gemini 3 Pro vision-language model to capture semantic overlap between expressions.
- Quality Control: The pipeline includes automated face detection, manual verification for identity clarity, and consistency checks to remove ambiguous or low-confidence samples.
- Metadata Construction: Prompts based on Qwen3-VL-235B-A22B are used to extract structured semantic attributes, while no personal metadata is retained to ensure privacy and ethical compliance.

Method

The authors propose PixelSmile, a diffusion framework designed for fine-grained facial expression editing that addresses intrinsic semantic overlap. As shown in the figure below, the method leverages the Flex Facial Expression (FFE) dataset with continuous affective annotations to establish a benchmark for structural confusion and editing accuracy.

The core architecture builds upon a pretrained Multi-Modal Diffusion Transformer (MMDiT) with LoRA adaptation. To enable continuous intensity control, the authors introduce a textual latent interpolation mechanism. Given a neutral prompt $P_{\text{neu}}$ and a target expression prompt $P_{\text{tgt}}$ , the frozen text encoder maps them to embeddings $e_{\text{neu}}$ and $e_{\text{tgt}}$ . The residual direction $\Delta e = e_{\text{tgt}} - e_{\text{neu}}$ captures the semantic shift from neutral to the target expression. A continuous conditioning embedding is then constructed as $e_{\text{cond}}(\alpha) = e_{\text{neu}} + \alpha \cdot \Delta e$ , where $\alpha \in [0, 1]$ determines the expression strength.

Refer to the framework diagram for the complete training and inference pipeline. The training stage employs a Fully Symmetric Joint Training framework to disentangle expression semantics. The model is fine-tuned under a symmetric dual-branch scheme where a pair of confusing expressions is optimized jointly. The overall objective combines a score-supervised Flow Matching loss, which explicitly couples the interpolation coefficient with visual transformation, and a symmetric contrastive loss. The contrastive loss pulls generated samples toward their target expression while pushing them away from confusing counterparts in the feature space. Additionally, an identity preservation loss based on ArcFace is applied to stabilize biometric features during strong intensity extrapolation.

Experiment

The FFE-Bench benchmark was established to evaluate facial expression editing across structural confusion, the trade-off between expression strength and identity preservation, control linearity, and editing accuracy.
Comparisons with general editing models demonstrate that the proposed method achieves superior expression editing accuracy and significantly reduces semantic confusion between similar expressions while maintaining natural identity fidelity.
Evaluations against linear control models reveal that the method enables smooth, monotonic expression transitions across a wide intensity range, effectively avoiding the identity collapse or unstable responses seen in existing baselines.
Ablation studies confirm that identity loss is essential for preventing facial attribute drift, while contrastive loss is critical for disentangling semantically overlapping expressions, with a symmetric training framework providing the most stable optimization.
User studies and qualitative analyses validate that the method offers the best balance between editing continuity and identity consistency, while also supporting the generation of plausible compound expressions through linear interpolation.

PDF source Voir le code

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

il y a 4 heures

Jiabin Hua Hengyuan Xu Aojie Li Wei Cheng Gang Yu Xingjun Ma Yu-Gang Jiang

Table des matières

Résumé

One-sentence Summary

Key Contributions

The paper introduces the Flex Facial Expression (FFE) dataset with continuous 12-dimensional affective annotations and establishes FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation.
A novel diffusion framework named PixelSmile is presented, which employs fully symmetric joint training and contrastive learning to disentangle overlapping expression semantics and produce stronger, more distinguishable facial expressions.
Experiments demonstrate that the proposed method achieves precise and stable linear expression control through textual latent interpolation while maintaining robust identity preservation and supporting smooth expression blending across real-world and anime domains.

Introduction

Dataset

Dataset Composition and Sources: The authors construct the FFE dataset through a four-stage pipeline to support fine-grained facial expression editing across real-world and anime domains. The final collection comprises 60,000 images, split evenly with 30,000 images per domain. The real-world subset draws from public portrait datasets like the Human Images Dataset and Matting Human Dataset, while the anime subset is curated from 207 productions covering 629 characters.
Key Details for Each Subset:
- Real Domain: Contains approximately 6,000 base identities featuring diverse demographics and scene compositions, including close-ups and full-body shots.
- Anime Domain: Includes stylized portraits from various styles such as CG, 2D, chibi, and manga, with a flatter age distribution but higher ambiguity in gender and age labels.
- Expression Variety: Both subsets utilize a structured prompt library for 12 target expressions (six basic and six extended emotions) decomposed into facial attributes to ensure anatomical consistency.
Data Usage and Training Strategy:
- Generation: The authors use the Nano Banana Pro image editing model to synthesize multiple target expressions with varying intensities for each base identity.
- Training Framework: A joint fully symmetric training approach is adopted where the model samples a source image and a confusing expression pair to construct triplets.
- Loss Components: The training objective combines a Flow-Matching loss for intensity alignment, a contrastive loss for expression separation, and an identity preservation loss.
- Ablation Study: For comparison, the authors also process the MEAD dataset by sampling front-view frames and mapping discrete intensity levels to continuous values.
Processing and Annotation Details:
- Continuous Annotation: Instead of one-hot labels, each image receives a 12-dimensional continuous score vector predicted by the Gemini 3 Pro vision-language model to capture semantic overlap between expressions.
- Quality Control: The pipeline includes automated face detection, manual verification for identity clarity, and consistency checks to remove ambiguous or low-confidence samples.
- Metadata Construction: Prompts based on Qwen3-VL-235B-A22B are used to extract structured semantic attributes, while no personal metadata is retained to ensure privacy and ethical compliance.

Method

Experiment

The FFE-Bench benchmark was established to evaluate facial expression editing across structural confusion, the trade-off between expression strength and identity preservation, control linearity, and editing accuracy.
Comparisons with general editing models demonstrate that the proposed method achieves superior expression editing accuracy and significantly reduces semantic confusion between similar expressions while maintaining natural identity fidelity.
Evaluations against linear control models reveal that the method enables smooth, monotonic expression transitions across a wide intensity range, effectively avoiding the identity collapse or unstable responses seen in existing baselines.
Ablation studies confirm that identity loss is essential for preventing facial attribute drift, while contrastive loss is critical for disentangling semantically overlapping expressions, with a symmetric training framework providing the most stable optimization.
User studies and qualitative analyses validate that the method offers the best balance between editing continuity and identity consistency, while also supporting the generation of plausible compound expressions through linear interpolation.

PDF source Voir le code

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

Command Palette

PixelSmile : Vers une édition fine des expressions faciales

Jiabin Hua Hengyuan Xu Aojie Li Wei Cheng Gang Yu Xingjun Ma Yu-Gang Jiang

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

PixelSmile : Vers une édition fine des expressions faciales

Jiabin Hua Hengyuan Xu Aojie Li Wei Cheng Gang Yu Xingjun Ma Yu-Gang Jiang

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

PixelSmile : Vers une édition fine des expressions faciales

Jiabin Hua Hengyuan Xu Aojie Li Wei Cheng Gang Yu Xingjun Ma Yu-Gang Jiang

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters