Command Palette
Search for a command to run...
UltraShape 1.0 : Génération de formes 3D à haute fidélité par raffinement géométrique évolutif
UltraShape 1.0 : Génération de formes 3D à haute fidélité par raffinement géométrique évolutif
Abstract
Dans ce rapport, nous présentons UltraShape 1.0, un cadre de diffusion 3D évolutif pour la génération de géométries 3D de haute fidélité. L'approche proposée repose sur une pipeline de génération en deux étapes : une structure globale grossière est d'abord synthétisée, puis raffinée afin de produire une géométrie détaillée et de haute qualité. Pour assurer une génération 3D fiable, nous avons développé une pipeline complète de traitement des données, incluant une méthode novatrice de traitement étanche (watertight) ainsi qu’un filtrage de données de haute qualité. Cette pipeline améliore la qualité géométrique des jeux de données 3D disponibles publiquement en éliminant les échantillons de faible qualité, en comblant les trous et en épaississant les structures minces, tout en préservant les détails géométriques fins. Pour permettre un raffinement précis de la géométrie, nous décomposons la localisation spatiale de la synthèse des détails géométriques au sein du processus de diffusion. Cela est réalisé par un raffinement basé sur des voxels effectué à des emplacements spatiaux fixes, où les requêtes de voxels issues de la géométrie grossière fournissent des repères positionnels explicites encodés via RoPE (Rotary Position Embedding), permettant au modèle de diffusion de se concentrer sur la synthèse des détails géométriques locaux dans un espace de solution réduit et structuré. Notre modèle est entraîné exclusivement sur des jeux de données 3D disponibles publiquement, tout en atteignant une qualité géométrique remarquable malgré des ressources d'entraînement limitées. Des évaluations étendues démontrent que UltraShape 1.0 se distingue favorablement par rapport aux méthodes open-source existantes tant en matière de qualité du traitement des données qu’en génération de géométrie. Le code source et les modèles entraînés seront publiés afin de soutenir les recherches futures.
One-sentence Summary
The authors from Peking University, HKUST (Guangzhou), HKUST, NUS, and NTU propose UltraShape 1.0, a scalable 3D diffusion framework that decouples spatial localization from detail synthesis using RoPE-encoded voxel queries at fixed positions, enabling fine-grained geometry refinement within a structured solution space and achieving high-fidelity 3D generation from public datasets without proprietary data.
Key Contributions
- UltraShape 1.0 addresses the challenge of scalable, high-fidelity 3D geometry generation by introducing a two-stage diffusion framework that first synthesizes a coarse global structure and then refines it with detailed geometry, overcoming limitations in resolution and fine-grained detail modeling common in existing methods.
- The method decouples spatial localization from geometric detail synthesis by performing voxel-based refinement at fixed spatial locations, using RoPE-encoded positional anchors derived from coarse geometry to guide the diffusion process within a structured, reduced solution space.
- A novel watertight data processing pipeline improves the quality of public 3D datasets by removing low-quality samples, filling holes, and thickening thin structures while preserving fine details, and UltraShape 1.0 achieves competitive performance against state-of-the-art open-source and commercial methods on benchmark datasets.
Introduction
3D content generation is critical for applications ranging from entertainment and gaming to robotics and industrial design, yet producing high-fidelity, scalable 3D geometry remains challenging due to the scarcity of high-quality data and the computational demands of 3D representations. Prior methods face limitations in handling non-watertight inputs, suffer from geometric artifacts like double-layered surfaces or missing components, and struggle with scalability and fine-grained detail preservation—especially when using dense voxel grids or vector set representations that either lack spatial resolution or incur prohibitive memory costs. The authors introduce UltraShape 1.0, a two-stage diffusion framework that combines robust data curation with a scalable generative pipeline. It leverages a novel watertight geometry processing strategy to resolve topological ambiguities and ensure clean, high-quality training data, while employing a coarse-to-fine approach with voxel-conditioned diffusion to enable stable, fine-grained refinement. By decoupling spatial localization from geometric detail synthesis through structured voxel queries, UltraShape 1.0 achieves superior geometric fidelity and scalability, addressing key bottlenecks in existing 3D generation methods.
Dataset
- The dataset is constructed from 120K filtered samples drawn from Objaverse, serving as the primary source for training and evaluation.
- For each object, approximately 600K surface points are sampled uniformly across the mesh, with increased density in high-curvature regions to preserve fine geometric details; these points are used as input to the VAE encoder.
- Supervision points total around 1M per object and include: uniformly sampled points near the surface, curvature-aware sharp points, and random samples in free space; signed distance function (SDF) values are computed for all supervision points to define the reconstruction loss.
- Image rendering is performed using Blender’s Cycles renderer with orthographic projection, generating 16 images per object: eight from near-frontal viewpoints and eight from randomly sampled orientations to ensure viewpoint diversity.
- All images are rendered at 1024×1024 resolution, with random environment maps selected during rendering to augment lighting conditions and improve visual robustness.
- The VAE for refinement is initialized from Hunyuan3D-2.1 and fine-tuned for 55K steps with uniform query perturbations in the range [-1/128, 1/128]. Training proceeds in two phases: 40K steps with 4096 tokens, followed by 15K steps with 8192 tokens to improve stability and support higher token counts.
- The diffusion transformer (DiT) for geometry refinement is also initialized from Hunyuan3D-2.1 and trained on this dataset using a progressive multi-stage strategy: starting with 4096 tokens at 518 resolution for 10K steps, then 8192 tokens at 1022 resolution for 15K steps, and finally 10240 tokens at 1022 resolution for 60K steps.
- Training is conducted on 8 NVIDIA H20 GPUs with a batch size of 32, using voxel resolution of 128 for both training and inference.
- During inference, the model uses 32,768 tokens and 1022×1022 image resolution, with token masking applied unless otherwise specified.
- The authors emphasize that high-quality input RGBA images are critical—accurate foreground segmentation and clean backgrounds without shadows are essential to avoid degradation in generated geometry, underscoring the importance of robust image pre-processing in image-conditioned 3D generation.
Method
The authors leverage a two-stage coarse-to-fine framework for 3D geometry generation, designed to balance global structural coherence with fine-grained geometric detail synthesis. This approach is structured to first generate a coarse global shape and then refine it to produce high-fidelity geometry. The overall pipeline is illustrated in the framework diagram, which shows the progression from input image tokens through two distinct stages of generation.

In the first stage, the model generates a coarse representation of the object's overall structure. This is achieved using a DiT-based 3D generation model operating on a vector set representation, which provides a compact and expressive encoding of global object geometry. The output of this stage is a coarse mesh, which serves as a semantically meaningful geometric prior for the subsequent refinement stage. The coarse mesh is then voxelized and sampled to generate voxel queries that define fixed spatial locations for the refinement process.
The second stage focuses on geometric detail refinement. To address the limitations of vector set-based methods, which often struggle with fine-grained detail due to large, unstructured latent spaces and the coupling of positional and geometric information, the authors decouple spatial localization from geometric detail synthesis. This is accomplished by performing diffusion-based refinement on voxel queries defined over a fixed-resolution grid. The coarse geometry provides explicit spatial anchors for refinement, with voxel queries derived from the coarse shape defining fixed spatial locations. These coordinates are encoded using rotary positional embeddings (RoPE), which inject spatial information into the model at each layer. By explicitly specifying spatial localization, the diffusion model is able to focus on synthesizing local geometric details rather than jointly modeling global positioning and shape, leading to improved convergence and finer geometric refinement.
To support voxel-based refinement, the shape VAE is extended to decode geometry at off-surface locations. During training, surface queries are augmented with bounded spatial perturbations, enabling the decoder to predict valid volumetric geometry. At inference time, voxel queries sampled from the coarse geometry are aligned with latent tokens and refined through a diffusion process. The denoised latent representation is decoded into an SDF field on a regular grid, from which the final surface is extracted using marching cubes. The refinement stage employs a DiT architecture with self-attention over latent tokens, and image conditioning is incorporated through cross-attention using DINOv2 features. An image token masking strategy is applied to suppress irrelevant background information, ensuring robust and semantically aligned geometry refinement.
Experiment
- Evaluated test-time scalability of the model, showing that increasing latent tokens improves reconstruction quality, enabling high-fidelity geometry reconstruction.
- Demonstrated superior 3D generation performance compared to open-source state-of-the-art methods, producing detailed, sharp geometries with strong alignment to input condition images.
- Achieved generation quality comparable to commercial 3D models despite training on public data and limited resources, highlighting strong competitiveness.
- Showed favorable test-time scalability of the DiT in geometry generation, with increased shape and image tokens during inference leading to significantly improved geometric details and surface quality.