il y a 2 mois

Table des matières

Résumé

La création de contenus 3D a connu des progrès significatifs en termes de qualité et de vitesse. Bien que les modèles actuels à propagation avant soient capables de produire des objets 3D en quelques secondes, leur résolution est limitée par les calculs intensifs requis durant l'entraînement. Dans ce papier, nous introduisons le Large Multi-View Gaussian Model (LGM), un cadre novateur conçu pour générer des modèles 3D à haute résolution à partir de promps textuels ou d'images à vue unique. Nos principales contributions s'articulent autour de deux axes : 1) Représentation 3D : nous proposons des caractéristiques gaussiennes multi-vues comme une représentation efficace et puissante, pouvant être fusionnées pour permettre un rendu différentiable. 2) Architecture centrale 3D : nous présentons un U-Net asymétrique agissant comme une architecture à haut débit opérant sur des images multi-vues, pouvant être générées à partir de promps textuels ou d'images à vue unique grâce à des modèles de diffusion multi-vues. Des expériences étendues démontrent la fidélité élevée et l'efficacité de notre approche. Notamment, nous conservons une vitesse rapide de génération d'objets 3D en moins de 5 secondes tout en augmentant la résolution d'entraînement jusqu'à 512, permettant ainsi une génération de contenus 3D à haute résolution.

One-sentence Summary

Peking University, Nanyang Technological University, and Shanghai AI Lab propose LGM, a feed-forward framework that generates high-resolution 3D Gaussians in under 5 seconds by fusing multi-view images from text or single-view inputs via an asymmetric U-Net backbone, overcoming prior limitations in resolution and efficiency by leveraging Gaussian splatting and robust data augmentation, enabling fast, high-fidelity 3D content creation for applications in gaming, VR, and film.

Key Contributions

The paper addresses the challenge of high-resolution 3D content creation by introducing LGM, a feed-forward framework that generates detailed 3D Gaussians from single-view images or text prompts, overcoming the resolution limitations of prior feed-forward methods that rely on low-resolution triplane representations.
LGM proposes a novel asymmetric U-Net backbone operating on multi-view images to efficiently predict and fuse 3D Gaussian features, enabling end-to-end differentiable rendering and training at a resolution of 512—significantly higher than previous methods—while maintaining fast inference speeds of under 5 seconds.
The method demonstrates superior performance on both text-to-3D and image-to-3D tasks, achieving high-fidelity results with up to 65,536 Gaussians, and includes data augmentations to bridge the domain gap between real and synthesized multi-view images, along with a general mesh extraction pipeline for downstream applications.

Introduction

The authors leverage 3D Gaussian splatting as a high-fidelity, efficient representation to enable fast, high-resolution 3D content generation from text or single-view images. Prior feed-forward methods rely on triplane-based NeRF or transformer backbones, which limit resolution due to memory constraints and inefficient rendering, while optimization-based approaches like SDS are slow despite higher detail. The key contribution is a novel framework using an asymmetric U-Net to regress dense 3D Gaussians from multi-view inputs, enabling end-to-end training at 512×512 resolution and generating high-quality 3D models in under 5 seconds. The method integrates multi-view diffusion models for input synthesis, applies targeted data augmentations to handle domain gaps, and includes a general mesh extraction pipeline, achieving state-of-the-art speed and resolution in both text-to-3D and image-to-3D tasks.

Dataset

The dataset is derived from Objaverse, a large-scale 3D object repository, with filtering applied using a predefined list of words to exclude unwanted or low-quality scenes.
The filtering keywords include: flying, mountain, trash, featuring, a set of, a small, numerous, square, collection, broken, group, ceiling, wall, various, elements, splatter, resembling, landscape, stair, silhouette, garbage, debris, room, preview, floor, grass, house, beam, white, background, building, cube, box, frame, roof, structure.
For each scene, 100 camera views are generated along a spiral path on a spherical surface, ensuring diverse and uniform spatial coverage.
The camera radius is fixed at 1.5 units, and the vertical field-of-view is set to 49.1 degrees to maintain consistent perspective and depth.
The authors use these filtered and rendered views as part of the training data, combining them with other datasets in a carefully balanced mixture ratio.
No explicit cropping is applied; instead, the full rendered image is used, with metadata such as camera pose and scene ID constructed during the rendering pipeline to support training and evaluation.

Method

The authors leverage a two-step pipeline for high-resolution 3D content generation, beginning with the synthesis of multi-view images from text or image inputs using off-the-shelf models. Specifically, MVDream is employed for text-to-multi-view generation, while ImageDream handles image (and optionally text) inputs, both producing four orthogonally oriented views at a fixed elevation. These multi-view images serve as input to a core U-Net-based model designed to predict and fuse 3D Gaussians. The overall framework is illustrated in the pipeline diagram, where the initial multi-view generation step is followed by Gaussian generation and optional mesh extraction.

The central component of the framework is an asymmetric U-Net architecture, as shown in the detailed architecture diagram. This network takes four input images and their corresponding camera ray embeddings as input. The RGB values and ray embeddings are concatenated to form a 9-channel feature map for each pixel, which is then processed through the U-Net. The architecture incorporates residual layers and self-attention mechanisms, with self-attention applied at deeper layers to reduce memory consumption. To enable cross-view information propagation, the features from the four input views are flattened and concatenated before self-attention is applied. The output of the U-Net is a feature map with 14 channels, which is interpreted as the parameters of 3D Gaussians. The network is designed to be asymmetric, with a lower output resolution than the input, allowing for higher-resolution inputs while limiting the number of output Gaussians. The predicted Gaussian parameters, including position, scale, rotation, opacity, and color, are fused across the four views to form the final 3D Gaussian representation.

To ensure robustness during training, the authors implement a data augmentation strategy to bridge the domain gap between training data (rendered from the Objverse dataset) and inference data (synthesized by diffusion models). This includes grid distortion, where three of the four input views are randomly distorted to simulate the inconsistencies inherent in multi-view diffusion outputs, and orbital camera jitter, where the camera poses of the last three views are randomly rotated around the scene center to account for inaccuracies in ray embeddings. The model is trained using a differentiable renderer to supervise the generated Gaussians. At each training step, images are rendered from eight views—four input views and four novel views—and the loss is computed using mean square error on both the RGB and alpha images. Additionally, a VGG-based LPIPS loss is applied to the RGB images to improve perceptual quality.

Experiment

Image-to-3D: Compared with [47, 62], our method generates 3D Gaussians with higher visual quality and better content preservation, enabling smooth textured meshes with minimal quality loss. Multi-view setting effectively reduces blur in back views and flat geometry, improving detail in unseen views.
Text-to-3D: Outperforms [16, 47] in realism and efficiency, achieving better text alignment and avoiding multi-face issues due to multi-view diffusion modeling.
Diversity: Demonstrates high diversity in 3D generation from ambiguous text or single-view images, producing varied plausible objects across different random seeds.
User Study (Table 1): On 30 images, 20 volunteers rated 600 samples; our method scored highest in image consistency and overall quality, outperforming DreamGaussian [47] and TriplaneGaussian [62].
Ablation Study: Single-view input leads to poor back-view reconstruction and blurriness; data augmentation improves 3D consistency and camera pose accuracy; higher training resolution (512×512) yields finer details than 256×256.
Meshing: Our meshing method produces smoother surfaces than DreamGaussian [47], independent of underlying Gaussians, and is advantageous for relighting.
Limitations: Failures mainly stem from low-resolution (256×256) multi-view images, leading to inaccuracies in slender structures and issues with high elevation angles.

Results show that the proposed method, LGM, achieves the highest ratings in both image consistency and overall quality compared to DreamGaussian and TriplaneGaussian. The authors use a user study to evaluate the generated 3D Gaussians, where LGM outperforms the baselines, demonstrating superior alignment with the input image content and better overall model quality.

PDF source

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

il y a 2 mois

Jiaxiang Tang Zhaoxi Chen Xiaokang Chen Tengfei Wang Gang Zeng Ziwei Liu

Table des matières

Résumé

One-sentence Summary

Key Contributions

The paper addresses the challenge of high-resolution 3D content creation by introducing LGM, a feed-forward framework that generates detailed 3D Gaussians from single-view images or text prompts, overcoming the resolution limitations of prior feed-forward methods that rely on low-resolution triplane representations.
LGM proposes a novel asymmetric U-Net backbone operating on multi-view images to efficiently predict and fuse 3D Gaussian features, enabling end-to-end differentiable rendering and training at a resolution of 512—significantly higher than previous methods—while maintaining fast inference speeds of under 5 seconds.
The method demonstrates superior performance on both text-to-3D and image-to-3D tasks, achieving high-fidelity results with up to 65,536 Gaussians, and includes data augmentations to bridge the domain gap between real and synthesized multi-view images, along with a general mesh extraction pipeline for downstream applications.

Introduction

Dataset

The dataset is derived from Objaverse, a large-scale 3D object repository, with filtering applied using a predefined list of words to exclude unwanted or low-quality scenes.
The filtering keywords include: flying, mountain, trash, featuring, a set of, a small, numerous, square, collection, broken, group, ceiling, wall, various, elements, splatter, resembling, landscape, stair, silhouette, garbage, debris, room, preview, floor, grass, house, beam, white, background, building, cube, box, frame, roof, structure.
For each scene, 100 camera views are generated along a spiral path on a spherical surface, ensuring diverse and uniform spatial coverage.
The camera radius is fixed at 1.5 units, and the vertical field-of-view is set to 49.1 degrees to maintain consistent perspective and depth.
The authors use these filtered and rendered views as part of the training data, combining them with other datasets in a carefully balanced mixture ratio.
No explicit cropping is applied; instead, the full rendered image is used, with metadata such as camera pose and scene ID constructed during the rendering pipeline to support training and evaluation.

Method

Experiment

Image-to-3D: Compared with [47, 62], our method generates 3D Gaussians with higher visual quality and better content preservation, enabling smooth textured meshes with minimal quality loss. Multi-view setting effectively reduces blur in back views and flat geometry, improving detail in unseen views.
Text-to-3D: Outperforms [16, 47] in realism and efficiency, achieving better text alignment and avoiding multi-face issues due to multi-view diffusion modeling.
Diversity: Demonstrates high diversity in 3D generation from ambiguous text or single-view images, producing varied plausible objects across different random seeds.
User Study (Table 1): On 30 images, 20 volunteers rated 600 samples; our method scored highest in image consistency and overall quality, outperforming DreamGaussian [47] and TriplaneGaussian [62].
Ablation Study: Single-view input leads to poor back-view reconstruction and blurriness; data augmentation improves 3D consistency and camera pose accuracy; higher training resolution (512×512) yields finer details than 256×256.
Meshing: Our meshing method produces smoother surfaces than DreamGaussian [47], independent of underlying Gaussians, and is advantageous for relighting.
Limitations: Failures mainly stem from low-resolution (256×256) multi-view images, leading to inaccuracies in slender structures and issues with high elevation angles.

PDF source

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

Command Palette

LGM : Modèle Gaussien Multi-Vue de Grande Taille pour la Création de Contenus 3D Haute Résolution

Jiaxiang Tang Zhaoxi Chen Xiaokang Chen Tengfei Wang Gang Zeng Ziwei Liu

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

LGM : Modèle Gaussien Multi-Vue de Grande Taille pour la Création de Contenus 3D Haute Résolution

Jiaxiang Tang Zhaoxi Chen Xiaokang Chen Tengfei Wang Gang Zeng Ziwei Liu

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

LGM : Modèle Gaussien Multi-Vue de Grande Taille pour la Création de Contenus 3D Haute Résolution

Jiaxiang Tang Zhaoxi Chen Xiaokang Chen Tengfei Wang Gang Zeng Ziwei Liu

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters