Abstract

We introduce OpenShape, a method for learning multi-modal jointrepresentations of text, image, and point clouds. We adopt the commonly usedmulti-modal contrastive learning framework for representation alignment, butwith a specific focus on scaling up 3D representations to enable open-world 3Dshape understanding. To achieve this, we scale up training data by ensemblingmultiple 3D datasets and propose several strategies to automatically filter andenrich noisy text descriptions. We also explore and compare strategies forscaling 3D backbone networks and introduce a novel hard negative mining modulefor more efficient training. We evaluate OpenShape on zero-shot 3Dclassification benchmarks and demonstrate its superior capabilities foropen-world recognition. Specifically, OpenShape achieves a zero-shot accuracyof 46.8% on the 1,156-category Objaverse-LVIS benchmark, compared to less than10% for existing methods. OpenShape also achieves an accuracy of 85.3% onModelNet40, outperforming previous zero-shot baseline methods by 20% andperforming on par with some fully-supervised methods. Furthermore, we show thatour learned embeddings encode a wide range of visual and semantic concepts(e.g., subcategories, color, shape, style) and facilitate fine-grained text-3Dand image-3D interactions. Due to their alignment with CLIP embeddings, ourlearned shape representations can also be integrated with off-the-shelfCLIP-based models for various applications, such as point cloud captioning andpoint cloud-conditioned image generation.

Source PDF