HyperAI초신경

ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding

Junliang Ye, Zhengyi Wang, Ruowen Zhao, Shenghao Xie, Jun Zhu
발행일: 6/4/2025
ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and
  Understanding
초록

Recently, the powerful text-to-image capabilities of ChatGPT-4o have led togrowing appreciation for native multimodal large language models. However, itsmultimodal capabilities remain confined to images and text. Yet beyond images,the ability to understand and generate 3D content is equally crucial. Toaddress this gap, we propose ShapeLLM-Omni-a native 3D large language modelcapable of understanding and generating 3D assets and text in any sequence.First, we train a 3D vector-quantized variational autoencoder (VQVAE), whichmaps 3D objects into a discrete latent space to achieve efficient and accurateshape representation and reconstruction. Building upon the 3D-aware discretetokens, we innovatively construct a large-scale continuous training datasetnamed 3D-Alpaca, encompassing generation, comprehension, and editing, thusproviding rich resources for future research and training. Finally, byperforming instruction-based training of the Qwen-2.5-vl-7B-Instruct model onthe 3D-Alpaca dataset. Our work provides an effective attempt at extendingmultimodal models with basic 3D capabilities, which contributes to futureresearch in 3D-native AI. Project page:https://github.com/JAMESYJL/ShapeLLM-Omni