1ヶ月前

概要

顔の写実的かつ制御可能な3D風刺化フレームワークを提案する。本研究では、内在的なガウス曲率に基づく表面の拡大技術を出発点とし、テクスチャと組み合わせると過剰に滑らかなレンダリングが生じる傾向がある。この問題を解決するため、近年、自由視点アバターの写実的生成が可能であることが示された3Dガウススプラッティング（3DGS）を採用する。複数視点の動画シーケンスを入力として、FLAMEメッシュを抽出し、曲率重み付きポアソン方程式を解くことで、拡大された形状を取得する。しかし、ガウス分布を直接変形すると劣った結果となるため、各フレームを局所アフィン変換により拡大された2D表現に歪ませることで、擬似真実風刺画像を合成する手法を導入する。その後、実際の画像と合成画像の間で交互に監視信号を用いる学習スキームを設計し、1つのガウス集合で自然なアバターと風刺的アバターの両方を表現可能にした。このスキームにより、再現精度が向上し、局所的な編集が可能となり、風刺の強度を連続的に制御できる。リアルタイムでの変形を実現するため、元の表面と拡大された表面の間で効率的な補間手法を導入した。さらに、この手法が閉形式解からの偏差が有界であることを理論的に解析・示した。定量的および定性的な評価において、既存手法を上回る結果を達成し、幾何学的に制御可能な写実的風刺アバターの生成を実現した。

One-sentence Summary

The authors from Technion – Israel Institute of Technology propose a photorealistic 3D caricaturization framework using 3D Gaussian Splatting with a curvature-weighted Poisson deformation and alternating real-synthetic supervision, enabling controllable, high-fidelity caricature avatars with real-time interpolation and local editability, outperforming prior methods in both geometry control and visual realism.

Key Contributions

The paper addresses the challenge of generating photorealistic 3D caricature avatars by combining intrinsic Gaussian curvature-based surface exaggeration with 3D Gaussian Splatting (3DGS), overcoming the over-smoothing issue that arises when applying traditional geometric exaggeration to textured 3D meshes.
It introduces a novel training scheme that alternates between real multiview images and pseudo-ground-truth caricature images, synthesized via per-triangle local affine transformations, enabling a single Gaussian set to represent both natural and exaggerated facial appearances with high fidelity.
The method supports real-time, continuous control over caricature intensity through efficient interpolation between original and exaggerated surfaces, and demonstrates superior performance in both quantitative metrics and qualitative evaluations compared to prior approaches.

Introduction

The authors leverage 3D Gaussian Splatting (3DGS) for creating photorealistic caricature avatars by combining curvature-driven geometric deformation with a mesh-rigged 3DGS representation. This approach addresses a key limitation in prior work—where most methods either focused on photorealistic rendering without exaggeration or applied caricature effects only to appearance, leaving geometry unchanged. By using per-triangle Local Affine Transforms (LAT) to warp a neutral FLAME mesh into a caricatured version, they generate pseudo-ground-truth image pairs that guide joint optimization of both neutral and exaggerated views. The main contribution is a unified framework where a single set of 3D Gaussians learns to render both natural and exaggerated appearances while preserving identity and expression, enabling controllable, geometry-aware caricatures that remain photorealistic under large deformations.

Dataset

The dataset is NeRSemble [20], a multi-view facial performance dataset captured using 16 synchronized high-resolution cameras arranged spatially around the subject.
It includes 10 scripted sequences: 4 emotion-driven (EMO) and 6 expression-driven (EXP), plus one additional free self-reenactment sequence.
The authors adopt the same train/validation/test split as in [21], ensuring consistency for fair comparison, with a training schedule of 120,000 iterations.
Data is processed to support multi-view rendering and facial animation modeling, with no explicit cropping mentioned, but the original high-resolution camera captures are used as input.
Metadata for each sequence is constructed based on the script type (EMO or EXP) and performance context, enabling controlled training and evaluation.
The dataset is used in a mixture of training ratios aligned with the original sequence types, with the full set of sequences contributing to model training and validation.

Method

The proposed framework for photorealistic and controllable 3D caricaturization leverages a multi-stage pipeline that integrates geometric deformation, pseudo-ground-truth generation, and a specialized training scheme for 3D Gaussian Splatting (3DGS). The overall architecture, illustrated in the figure below, begins with an input multiview video, from which a temporally consistent FLAME mesh is extracted. This mesh serves as the foundation for the subsequent steps.

The first stage, surface caricaturization, applies a curvature-driven deformation to the extracted FLAME mesh. This process is formulated as a weighted Poisson equation on the surface, where the weights are defined by the Gaussian curvature $K(p)$ raised to a power $\gamma$ , i.e., $w(\gamma) = |K|^{\gamma}$ . This formulation allows for the exaggeration of facial features based on their intrinsic curvature, with higher curvature regions being amplified more. The solution to this equation, $S_{\gamma}$ , is obtained by solving a discrete least-squares problem using the Laplace-Beltrami operator. To enable localized control, the method also supports constrained deformation by imposing boundary conditions on specific vertices, allowing for targeted exaggerations.

The second stage generates pseudo-ground-truth caricature images (GT*) to supervise the 3DGS training. Since real caricature images are unavailable, the authors synthesize GT* by warping the original input frames. This is achieved through Local Affine Transformations (LAT), which exploit the per-triangle correspondence between the original and deformed meshes. For each triangle in the deformed mesh, a unique affine map is computed to warp the corresponding pixels from the original image. To handle occlusions and ensure robustness, a 2D triangle-level mask is generated to identify unreliable regions, and a spatial mask is applied to freeze the parameters of Gaussians corresponding to areas like hair, which are difficult to warp reliably.

The third stage, CaricatureGS Training, involves optimizing a single set of 3D Gaussian primitives. These Gaussians are rigged to the original FLAME mesh, and their attributes (position, scale, rotation, opacity, and color) are updated based on a photometric loss. The key innovation is an alternating training scheme that stochastically switches between real input frames and the synthesized GT* images. This joint optimization allows the Gaussian set to learn both natural and caricatured appearances simultaneously. The use of masks during GT* steps prevents the propagation of artifacts from unreliable warping, while the real frames provide essential supervision to fill in occluded regions and preserve fine details like hair. This shared representation enables the model to generalize across a continuous range of caricature intensities.

Experiment

Evaluated on NeRSemble dataset using photorealistic rendering and identity preservation as main axes, comparing against SurFHead baseline with unconstrained exaggeration (γ_f = 0.25).
Achieved superior performance across all metrics: CLIP-I, CLIP-D, CLIP-C, DINO, and SD, demonstrating better caricature intent alignment, identity preservation, and multi-view consistency.
Outperformed diffusion-based mesh-free editing (GaussianEditor) in geometry stability, specular consistency, and multi-view coherence.
Ablation confirmed alternating supervision (original and GT* frames) is essential—training on either domain alone leads to overfitting and artifacts, while alternating enables smooth interpolation and high fidelity across caricature intensities.
On NeRSemble dataset, achieved high-quality caricaturization with 256 test frames, 4 emotions, 6 expressions, and 10 subjects, using 120K training iterations on a single RTX 3090.

Results show that the proposed method outperforms SurFHead across all evaluated metrics, achieving higher CLIP-I, CLIP-D, CLIP-C, DINO, and SD scores. This indicates improved alignment with the intended caricature intent, better identity preservation, and greater consistency across views compared to the baseline.

ソースPDF

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

1ヶ月前

Eldad Matmon Amit Bracha Noam Rotstein Ron Kimmel

概要

One-sentence Summary

Key Contributions

The paper addresses the challenge of generating photorealistic 3D caricature avatars by combining intrinsic Gaussian curvature-based surface exaggeration with 3D Gaussian Splatting (3DGS), overcoming the over-smoothing issue that arises when applying traditional geometric exaggeration to textured 3D meshes.
It introduces a novel training scheme that alternates between real multiview images and pseudo-ground-truth caricature images, synthesized via per-triangle local affine transformations, enabling a single Gaussian set to represent both natural and exaggerated facial appearances with high fidelity.
The method supports real-time, continuous control over caricature intensity through efficient interpolation between original and exaggerated surfaces, and demonstrates superior performance in both quantitative metrics and qualitative evaluations compared to prior approaches.

Introduction

Dataset

The dataset is NeRSemble [20], a multi-view facial performance dataset captured using 16 synchronized high-resolution cameras arranged spatially around the subject.
It includes 10 scripted sequences: 4 emotion-driven (EMO) and 6 expression-driven (EXP), plus one additional free self-reenactment sequence.
The authors adopt the same train/validation/test split as in [21], ensuring consistency for fair comparison, with a training schedule of 120,000 iterations.
Data is processed to support multi-view rendering and facial animation modeling, with no explicit cropping mentioned, but the original high-resolution camera captures are used as input.
Metadata for each sequence is constructed based on the script type (EMO or EXP) and performance context, enabling controlled training and evaluation.
The dataset is used in a mixture of training ratios aligned with the original sequence types, with the full set of sequences contributing to model training and validation.

Method

Experiment

Evaluated on NeRSemble dataset using photorealistic rendering and identity preservation as main axes, comparing against SurFHead baseline with unconstrained exaggeration (γ_f = 0.25).
Achieved superior performance across all metrics: CLIP-I, CLIP-D, CLIP-C, DINO, and SD, demonstrating better caricature intent alignment, identity preservation, and multi-view consistency.
Outperformed diffusion-based mesh-free editing (GaussianEditor) in geometry stability, specular consistency, and multi-view coherence.
Ablation confirmed alternating supervision (original and GT* frames) is essential—training on either domain alone leads to overfitting and artifacts, while alternating enables smooth interpolation and high fidelity across caricature intensities.
On NeRSemble dataset, achieved high-quality caricaturization with 256 test frames, 4 emotions, 6 expressions, and 10 subjects, using 120K training iterations on a single RTX 3090.

ソースPDF

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

Command Palette

カリカチャGS：ガウス曲率を用いた3Dガウススプラッティング顔の誇張

Eldad Matmon Amit Bracha Noam Rotstein Ron Kimmel

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

カリカチャGS：ガウス曲率を用いた3Dガウススプラッティング顔の誇張

Eldad Matmon Amit Bracha Noam Rotstein Ron Kimmel

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

カリカチャGS：ガウス曲率を用いた3Dガウススプラッティング顔の誇張

Eldad Matmon Amit Bracha Noam Rotstein Ron Kimmel

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters