Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

A voice AI agent that blends seamlessly into daily life would interact withhumans in an autonomous, real-time, and emotionally expressive manner. Ratherthan merely reacting to commands, it would continuously listen, reason, andrespond proactively, fostering fluid, dynamic, and emotionally resonantinteractions. We introduce Voila, a family of large voice-language foundationmodels that make a step towards this vision. Voila moves beyond traditionalpipeline systems by adopting a new end-to-end architecture that enablesfull-duplex, low-latency conversations while preserving rich vocal nuances suchas tone, rhythm, and emotion. It achieves a response latency of just 195milliseconds, surpassing the average human response time. Its hierarchicalmulti-scale Transformer integrates the reasoning capabilities of large languagemodels (LLMs) with powerful acoustic modeling, enabling natural, persona-awarevoice generation -- where users can simply write text instructions to definethe speaker's identity, tone, and other characteristics. Moreover, Voilasupports over one million pre-built voices and efficient customization of newones from brief audio samples as short as 10 seconds. Beyond spoken dialogue,Voila is designed as a unified model for a wide range of voice-basedapplications, including automatic speech recognition (ASR), Text-to-Speech(TTS), and, with minimal adaptation, multilingual speech translation. Voila isfully open-sourced to support open research and accelerate progress towardnext-generation human-machine interactions.