Google Launches Gemini 3.1 Flash TTS Supporting Over 70 Languages
On April 15, 2026, Google officially unveiled Gemini 3.1 Flash TTS, a new text-to-speech model designed to deliver superior speech quality, expressiveness, and granular control for AI-generated audio. This latest iteration aims to empower developers, enterprises, and everyday users to build next-generation applications with more natural and nuanced voice interactions. The model distinguishes itself through significant improvements in audio fidelity, achieving an Elo score of 1,211 on the Artificial Analysis TTS leaderboard, a benchmark driven by thousands of human preferences. Artificial Analysis has placed Gemini 3.1 Flash TTS in its most attractive quadrant, citing an ideal balance of high-quality output and cost efficiency. Key features include native support for multi-speaker dialogues and compatibility with over 70 languages, facilitating localized and immersive experiences on a global scale. A defining innovation in this release is the introduction of audio tags, which allow users to direct vocal style, pacing, and delivery using natural language commands embedded directly within the text input. This functionality places developers in what the team describes as the "director's chair," enabling precise adjustments for specific scenarios. The system offers three primary control levels: scene direction, where environment and dialogue instructions set the context for natural character interactions; speaker-level specificity, which allows for unique audio profiles and real-time tone adjustments via director's notes; and seamless export capabilities that translate these settings into reusable Gemini API code for consistent voice performance across projects. Google is rolling out the model in preview phases across its platforms. Developers can access the Gemini API and experiment in Google AI Studio starting today. Enterprise users can test the technology on Vertex AI, while Google Workspace users can utilize the new capabilities through Google Vids. Early testers have reported that these audio tags transform simple text into high-fidelity vocal performances, offering a new level of creative precision. Addressing the critical need for transparency in the age of generative AI, all audio generated by Gemini 3.1 Flash TTS is watermarked with SynthID. This imperceptible watermark is embedded directly into the audio output, enabling reliable detection of AI-generated content to help prevent misinformation. Google emphasizes that generative AI tools remain experimental and encourages users to exercise caution while leveraging these advanced capabilities. The launch marks a significant step forward in making AI speech more human-like and controllable, setting a new standard for digital audio applications.
