VoiceAssistant-400K Voice Assistant Optimization Dataset
Date
Size
Publish URL
Categories
VoiceAssistant-400K is a dataset optimized for voice assistants. It aims to help the model reduce the generation of code symbols when providing voice assistant services and enhance the practicality of the model in real applications. This dataset was developed to train and optimize the voice output of the Mini-Omni model. It was launched by a research team from Tsinghua University in 2024. The related paper results are "Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming". Mini-Omni is an open-source, multimodal, large-scale language model with real-time conversation capabilities and end-to-end speech input and output. Through a unique text-guided parallel generation method, speech reasoning output consistent with text capabilities is achieved with minimal additional data and modules.
The VoiceAssistant-400K dataset optimizes speech-to-text and text-to-speech adapters through a three-stage training process to support the performance of the models when providing voice assistant services. These stages include modality alignment, adaptation training, and multimodal fine-tuning. In the modality alignment stage, the model's speech recognition and synthesis capabilities are trained by using data from speech recognition and speech synthesis. The adaptation training stage focuses on training the model's text capabilities given audio input. The final multimodal fine-tuning stage fine-tunes the entire model using synthetic data to ensure the quality of multimodal output.