Uni3D: Exploring Unified 3D Representation at Scale

Scaling up representations for images or text has been extensivelyinvestigated in the past few years and has led to revolutions in learningvision and language. However, scalable representation for 3D objects and scenesis relatively unexplored. In this work, we present Uni3D, a 3D foundation modelto explore the unified 3D representation at scale. Uni3D uses a 2D initializedViT end-to-end pretrained to align the 3D point cloud features with theimage-text aligned features. Via the simple architecture and pretext task,Uni3D can leverage abundant 2D pretrained models as initialization andimage-text aligned models as the target, unlocking the great potential of 2Dmodels and scaling-up strategies to the 3D world. We efficiently scale up Uni3Dto one billion parameters, and set new records on a broad range of 3D tasks,such as zero-shot classification, few-shot classification, open-worldunderstanding and part segmentation. We show that the strong Uni3Drepresentation also enables applications such as 3D painting and retrieval inthe wild. We believe that Uni3D provides a new direction for exploring bothscaling up and efficiency of the representation in 3D domain.