2ヶ月前

概要

3Dコンテンツの作成は、品質と速度の両面で顕著な進展を遂げている。現在の前向き型（feed-forward）モデルは、数秒で3Dオブジェクトを生成可能であるが、トレーニング時に要する膨大な計算負荷により、解像度に制限が生じている。本論文では、テキストプロンプトまたは単一ビュー画像から高解像度の3Dモデルを生成することを目的とした、新たなフレームワーク「Large Multi-View Gaussian Model（LGM）」を提案する。本研究の核心的な洞察は以下の2点である：1）3D表現：マルチビュー・ガウス特徴量を、効率的かつ強力な表現として提案し、これを微分可能なレンダリングのために統合可能とする。2）3Dバックボーン：テキストまたは単一ビュー画像入力からマルチビュー拡散モデルを活用して生成されるマルチビュー画像を処理可能な、非対称U-Netを高スループットのバックボーンとして提案する。広範な実験により、本手法の高い忠実性と効率性が実証された。特に、5秒以内で3Dオブジェクトを生成する高速性を維持しつつ、トレーニング解像度を512まで向上させることで、高解像度3Dコンテンツの生成を実現した。

One-sentence Summary

Peking University, Nanyang Technological University, and Shanghai AI Lab propose LGM, a feed-forward framework that generates high-resolution 3D Gaussians in under 5 seconds by fusing multi-view images from text or single-view inputs via an asymmetric U-Net backbone, overcoming prior limitations in resolution and efficiency by leveraging Gaussian splatting and robust data augmentation, enabling fast, high-fidelity 3D content creation for applications in gaming, VR, and film.

Key Contributions

The paper addresses the challenge of high-resolution 3D content creation by introducing LGM, a feed-forward framework that generates detailed 3D Gaussians from single-view images or text prompts, overcoming the resolution limitations of prior feed-forward methods that rely on low-resolution triplane representations.
LGM proposes a novel asymmetric U-Net backbone operating on multi-view images to efficiently predict and fuse 3D Gaussian features, enabling end-to-end differentiable rendering and training at a resolution of 512—significantly higher than previous methods—while maintaining fast inference speeds of under 5 seconds.
The method demonstrates superior performance on both text-to-3D and image-to-3D tasks, achieving high-fidelity results with up to 65,536 Gaussians, and includes data augmentations to bridge the domain gap between real and synthesized multi-view images, along with a general mesh extraction pipeline for downstream applications.

Introduction

The authors leverage 3D Gaussian splatting as a high-fidelity, efficient representation to enable fast, high-resolution 3D content generation from text or single-view images. Prior feed-forward methods rely on triplane-based NeRF or transformer backbones, which limit resolution due to memory constraints and inefficient rendering, while optimization-based approaches like SDS are slow despite higher detail. The key contribution is a novel framework using an asymmetric U-Net to regress dense 3D Gaussians from multi-view inputs, enabling end-to-end training at 512×512 resolution and generating high-quality 3D models in under 5 seconds. The method integrates multi-view diffusion models for input synthesis, applies targeted data augmentations to handle domain gaps, and includes a general mesh extraction pipeline, achieving state-of-the-art speed and resolution in both text-to-3D and image-to-3D tasks.

Dataset

The dataset is derived from Objaverse, a large-scale 3D object repository, with filtering applied using a predefined list of words to exclude unwanted or low-quality scenes.
The filtering keywords include: flying, mountain, trash, featuring, a set of, a small, numerous, square, collection, broken, group, ceiling, wall, various, elements, splatter, resembling, landscape, stair, silhouette, garbage, debris, room, preview, floor, grass, house, beam, white, background, building, cube, box, frame, roof, structure.
For each scene, 100 camera views are generated along a spiral path on a spherical surface, ensuring diverse and uniform spatial coverage.
The camera radius is fixed at 1.5 units, and the vertical field-of-view is set to 49.1 degrees to maintain consistent perspective and depth.
The authors use these filtered and rendered views as part of the training data, combining them with other datasets in a carefully balanced mixture ratio.
No explicit cropping is applied; instead, the full rendered image is used, with metadata such as camera pose and scene ID constructed during the rendering pipeline to support training and evaluation.

Method

The authors leverage a two-step pipeline for high-resolution 3D content generation, beginning with the synthesis of multi-view images from text or image inputs using off-the-shelf models. Specifically, MVDream is employed for text-to-multi-view generation, while ImageDream handles image (and optionally text) inputs, both producing four orthogonally oriented views at a fixed elevation. These multi-view images serve as input to a core U-Net-based model designed to predict and fuse 3D Gaussians. The overall framework is illustrated in the pipeline diagram, where the initial multi-view generation step is followed by Gaussian generation and optional mesh extraction.

The central component of the framework is an asymmetric U-Net architecture, as shown in the detailed architecture diagram. This network takes four input images and their corresponding camera ray embeddings as input. The RGB values and ray embeddings are concatenated to form a 9-channel feature map for each pixel, which is then processed through the U-Net. The architecture incorporates residual layers and self-attention mechanisms, with self-attention applied at deeper layers to reduce memory consumption. To enable cross-view information propagation, the features from the four input views are flattened and concatenated before self-attention is applied. The output of the U-Net is a feature map with 14 channels, which is interpreted as the parameters of 3D Gaussians. The network is designed to be asymmetric, with a lower output resolution than the input, allowing for higher-resolution inputs while limiting the number of output Gaussians. The predicted Gaussian parameters, including position, scale, rotation, opacity, and color, are fused across the four views to form the final 3D Gaussian representation.

To ensure robustness during training, the authors implement a data augmentation strategy to bridge the domain gap between training data (rendered from the Objverse dataset) and inference data (synthesized by diffusion models). This includes grid distortion, where three of the four input views are randomly distorted to simulate the inconsistencies inherent in multi-view diffusion outputs, and orbital camera jitter, where the camera poses of the last three views are randomly rotated around the scene center to account for inaccuracies in ray embeddings. The model is trained using a differentiable renderer to supervise the generated Gaussians. At each training step, images are rendered from eight views—four input views and four novel views—and the loss is computed using mean square error on both the RGB and alpha images. Additionally, a VGG-based LPIPS loss is applied to the RGB images to improve perceptual quality.

Experiment

Image-to-3D: Compared with [47, 62], our method generates 3D Gaussians with higher visual quality and better content preservation, enabling smooth textured meshes with minimal quality loss. Multi-view setting effectively reduces blur in back views and flat geometry, improving detail in unseen views.
Text-to-3D: Outperforms [16, 47] in realism and efficiency, achieving better text alignment and avoiding multi-face issues due to multi-view diffusion modeling.
Diversity: Demonstrates high diversity in 3D generation from ambiguous text or single-view images, producing varied plausible objects across different random seeds.
User Study (Table 1): On 30 images, 20 volunteers rated 600 samples; our method scored highest in image consistency and overall quality, outperforming DreamGaussian [47] and TriplaneGaussian [62].
Ablation Study: Single-view input leads to poor back-view reconstruction and blurriness; data augmentation improves 3D consistency and camera pose accuracy; higher training resolution (512×512) yields finer details than 256×256.
Meshing: Our meshing method produces smoother surfaces than DreamGaussian [47], independent of underlying Gaussians, and is advantageous for relighting.
Limitations: Failures mainly stem from low-resolution (256×256) multi-view images, leading to inaccuracies in slender structures and issues with high elevation angles.

Results show that the proposed method, LGM, achieves the highest ratings in both image consistency and overall quality compared to DreamGaussian and TriplaneGaussian. The authors use a user study to evaluate the generated 3D Gaussians, where LGM outperforms the baselines, demonstrating superior alignment with the input image content and better overall model quality.

ソースPDF

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

2ヶ月前

Jiaxiang Tang Zhaoxi Chen Xiaokang Chen Tengfei Wang Gang Zeng Ziwei Liu

概要

One-sentence Summary

Key Contributions

The paper addresses the challenge of high-resolution 3D content creation by introducing LGM, a feed-forward framework that generates detailed 3D Gaussians from single-view images or text prompts, overcoming the resolution limitations of prior feed-forward methods that rely on low-resolution triplane representations.
LGM proposes a novel asymmetric U-Net backbone operating on multi-view images to efficiently predict and fuse 3D Gaussian features, enabling end-to-end differentiable rendering and training at a resolution of 512—significantly higher than previous methods—while maintaining fast inference speeds of under 5 seconds.
The method demonstrates superior performance on both text-to-3D and image-to-3D tasks, achieving high-fidelity results with up to 65,536 Gaussians, and includes data augmentations to bridge the domain gap between real and synthesized multi-view images, along with a general mesh extraction pipeline for downstream applications.

Introduction

Dataset

The dataset is derived from Objaverse, a large-scale 3D object repository, with filtering applied using a predefined list of words to exclude unwanted or low-quality scenes.
The filtering keywords include: flying, mountain, trash, featuring, a set of, a small, numerous, square, collection, broken, group, ceiling, wall, various, elements, splatter, resembling, landscape, stair, silhouette, garbage, debris, room, preview, floor, grass, house, beam, white, background, building, cube, box, frame, roof, structure.
For each scene, 100 camera views are generated along a spiral path on a spherical surface, ensuring diverse and uniform spatial coverage.
The camera radius is fixed at 1.5 units, and the vertical field-of-view is set to 49.1 degrees to maintain consistent perspective and depth.
The authors use these filtered and rendered views as part of the training data, combining them with other datasets in a carefully balanced mixture ratio.
No explicit cropping is applied; instead, the full rendered image is used, with metadata such as camera pose and scene ID constructed during the rendering pipeline to support training and evaluation.

Method

Experiment

Image-to-3D: Compared with [47, 62], our method generates 3D Gaussians with higher visual quality and better content preservation, enabling smooth textured meshes with minimal quality loss. Multi-view setting effectively reduces blur in back views and flat geometry, improving detail in unseen views.
Text-to-3D: Outperforms [16, 47] in realism and efficiency, achieving better text alignment and avoiding multi-face issues due to multi-view diffusion modeling.
Diversity: Demonstrates high diversity in 3D generation from ambiguous text or single-view images, producing varied plausible objects across different random seeds.
User Study (Table 1): On 30 images, 20 volunteers rated 600 samples; our method scored highest in image consistency and overall quality, outperforming DreamGaussian [47] and TriplaneGaussian [62].
Ablation Study: Single-view input leads to poor back-view reconstruction and blurriness; data augmentation improves 3D consistency and camera pose accuracy; higher training resolution (512×512) yields finer details than 256×256.
Meshing: Our meshing method produces smoother surfaces than DreamGaussian [47], independent of underlying Gaussians, and is advantageous for relighting.
Limitations: Failures mainly stem from low-resolution (256×256) multi-view images, leading to inaccuracies in slender structures and issues with high elevation angles.

ソースPDF

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

Command Palette

LGM：高解像度3Dコンテンツ作成のための大型マルチビューGaussianモデル

Jiaxiang Tang Zhaoxi Chen Xiaokang Chen Tengfei Wang Gang Zeng Ziwei Liu

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

LGM：高解像度3Dコンテンツ作成のための大型マルチビューGaussianモデル

Jiaxiang Tang Zhaoxi Chen Xiaokang Chen Tengfei Wang Gang Zeng Ziwei Liu

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

LGM：高解像度3Dコンテンツ作成のための大型マルチビューGaussianモデル

Jiaxiang Tang Zhaoxi Chen Xiaokang Chen Tengfei Wang Gang Zeng Ziwei Liu

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters