2 months ago

Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following

Guo, Ziyu ; Zhang, Renrui ; Zhu, Xiangyang ; Tang, Yiwen ; Ma, Xianzheng ; Han, Jiaming ; Chen, Kexin ; Gao, Peng ; Li, Xianzhi ; Li, Hongsheng ; Heng, Pheng-Ann

View Paper Details

Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D
Understanding, Generation, and Instruction Following

Abstract

We introduce Point-Bind, a 3D multi-modality model aligning point clouds with2D image, language, audio, and video. Guided by ImageBind, we construct a jointembedding space between 3D and multi-modalities, enabling many promisingapplications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3Dopen-world understanding. On top of this, we further present Point-LLM, thefirst 3D large language model (LLM) following 3D multi-modal instructions. Byparameter-efficient fine-tuning techniques, Point-LLM injects the semantics ofPoint-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instructiondata, but exhibits superior 3D and multi-modal question-answering capacity. Wehope our work may cast a light on the community for extending 3D point cloudsto multi-modality applications. Code is available athttps://github.com/ZiyuGuo99/Point-Bind_Point-LLM.