HyperAI

VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

Zuwei Long, Yunhang Shen, Chaoyou Fu, Heting Gao, Lijiang Li, Peixian Chen, Mengdan Zhang, Hang Shao, Jian Li, Jinlong Peng, Haoyu Cao, Ke Li, Rongrong Ji, Xing Sun
Veröffentlichungsdatum: 5/13/2025
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient
  Large Speech-Language Model
Abstract

With the growing requirement for natural human-computer interaction,speech-based systems receive increasing attention as speech is one of the mostcommon forms of daily communication. However, the existing speech models stillexperience high latency when generating the first audio token during streaming,which poses a significant bottleneck for deployment. To address this issue, wepropose VITA-Audio, an end-to-end large speech model with fast audio-text tokengeneration. Specifically, we introduce a lightweight Multiple Cross-modal TokenPrediction (MCTP) module that efficiently generates multiple audio tokenswithin a single model forward pass, which not only accelerates the inferencebut also significantly reduces the latency for generating the first audio instreaming scenarios. In addition, a four-stage progressive training strategy isexplored to achieve model acceleration with minimal loss of speech quality. Toour knowledge, VITA-Audio is the first multi-modal large language model capableof generating audio output during the first forward pass, enabling real-timeconversational capabilities with minimal latency. VITA-Audio is fullyreproducible and is trained on open-source data only. Experimental resultsdemonstrate that our model achieves an inference speedup of 3~5x at the 7Bparameter scale, but also significantly outperforms open-source models ofsimilar model size on multiple benchmarks for automatic speech recognition(ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.