ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding

Junliang Ye, Zhengyi Wang, Ruowen Zhao, Shenghao Xie, Jun Zhu

발행일: 6/4/2025

ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and
Understanding

초록

Recently, the powerful text-to-image capabilities of ChatGPT-4o have led togrowing appreciation for native multimodal large language models. However, itsmultimodal capabilities remain confined to images and text. Yet beyond images,the ability to understand and generate 3D content is equally crucial. Toaddress this gap, we propose ShapeLLM-Omni-a native 3D large language modelcapable of understanding and generating 3D assets and text in any sequence.First, we train a 3D vector-quantized variational autoencoder (VQVAE), whichmaps 3D objects into a discrete latent space to achieve efficient and accurateshape representation and reconstruction. Building upon the 3D-aware discretetokens, we innovatively construct a large-scale continuous training datasetnamed 3D-Alpaca, encompassing generation, comprehension, and editing, thusproviding rich resources for future research and training. Finally, byperforming instruction-based training of the Qwen-2.5-vl-7B-Instruct model onthe 3D-Alpaca dataset. Our work provides an effective attempt at extendingmultimodal models with basic 3D capabilities, which contributes to futureresearch in 3D-native AI. Project page:https://github.com/JAMESYJL/ShapeLLM-Omni

논문 세부 정보 보기