Phantom of Latent for Large Language and Vision Models

The success of visual instruction tuning has accelerated the development oflarge language and vision models (LLVMs). Following the scaling laws ofinstruction-tuned large language models (LLMs), LLVMs either have furtherincreased their sizes, reaching 26B, 34B, and even 80B parameters. While thisincrease in model size has yielded significant performance gains, it demandssubstantially more hardware resources for both training and inference.Consequently, there naturally exists a strong need for efficient LLVMs thatachieve the performance of larger models while being smaller in size. Toachieve this need, we present a new efficient LLVM family with model sizes of0.5B, 1.8B, 3.8B, and 7B parameters, Phantom, which significantly enhanceslearning capabilities within limited structures. By temporarily increasing thelatent hidden dimension during multi-head self-attention (MHSA), we make LLVMsprepare to look and understand much more vision-language knowledge on thelatent, without substantially increasing physical model sizes. To maximize itsadvantage, we introduce Phantom Optimization (PO) using both autoregressivesupervised fine-tuning (SFT) and direct preference optimization (DPO)-likeconcept, which effectively follows correct answers while eliminating incorrectand ambiguous ones. Phantom outperforms numerous larger open- and closed-sourceLLVMs, positioning itself as a leading solution in the landscape of efficientLLVMs.