DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation

Recent Diffusion Transformers (e.g., DiT) have demonstrated their powerfuleffectiveness in generating high-quality 2D images. However, it is still beingdetermined whether the Transformer architecture performs equally well in 3Dshape generation, as previous 3D diffusion methods mostly adopted the U-Netarchitecture. To bridge this gap, we propose a novel Diffusion Transformer for3D shape generation, namely DiT-3D, which can directly operate the denoisingprocess on voxelized point clouds using plain Transformers. Compared toexisting U-Net approaches, our DiT-3D is more scalable in model size andproduces much higher quality generations. Specifically, the DiT-3D adopts thedesign philosophy of DiT but modifies it by incorporating 3D positional andpatch embeddings to adaptively aggregate input from voxelized point clouds. Toreduce the computational cost of self-attention in 3D shape generation, weincorporate 3D window attention into Transformer blocks, as the increased 3Dtoken length resulting from the additional dimension of voxels can lead to highcomputation. Finally, linear and devoxelization layers are used to predict thedenoised point clouds. In addition, our transformer architecture supportsefficient fine-tuning from 2D to 3D, where the pre-trained DiT-2D checkpoint onImageNet can significantly improve DiT-3D on ShapeNet. Experimental results onthe ShapeNet dataset demonstrate that the proposed DiT-3D achievesstate-of-the-art performance in high-fidelity and diverse 3D point cloudgeneration. In particular, our DiT-3D decreases the 1-Nearest Neighbor Accuracyof the state-of-the-art method by 4.59 and increases the Coverage metric by3.51 when evaluated on Chamfer Distance.