AI Powers Real-Time Language Translation in Google Meet, Enabling Seamless Multilingual Conversations
Fredric, who leads the audio engineering team for Google Meet, has witnessed firsthand how AI has revolutionized the capabilities of his team. Two years ago, they began developing real-time speech translation for live Google Meet calls. At the time, existing systems could only perform offline translation, but the goal was to make it instantaneous—essential for natural, flowing conversations. Determined to achieve this, Fredric’s team partnered with Google DeepMind. “When we started, we thought, ‘Maybe this will take five years,’” he recalls. Just two years later, the technology is live. “With AI, things just keep accelerating,” he says. Today, a broad coalition of Google teams—including those from Pixel, Cloud, and Chrome—are collaborating with DeepMind to bring real-time speech translation to life. The key challenge in earlier translation systems was their reliance on a multi-step process: first transcribing speech into text, then translating the text, and finally converting it back into spoken language. This approach introduced delays of 10 to 20 seconds, making real-time conversation nearly impossible. What’s more, the synthesized voices sounded artificial and lacked the speaker’s unique tone, pitch, and rhythm. The breakthrough came with the use of large, end-to-end models capable of “one-shot” translation. Instead of processing speech through multiple stages, these models take audio input and directly generate translated audio output. “You send audio in, and the model starts producing the translated audio almost immediately,” explains Huib, who leads product management for audio quality. This reduced latency to just two or three seconds—what the team identified as the “sweet spot.” Faster translation was hard to understand, while slower responses broke the flow of conversation. Achieving this balance made simultaneous, natural-sounding multilingual conversations possible. Developing the feature wasn’t without obstacles. Ensuring high-quality translation across diverse conditions—such as varying accents, background noise, or unstable network connections—was a major challenge. The Meet and DeepMind teams worked closely together, continuously testing and refining models based on real-world performance. They also collaborated with linguists and language experts to better understand regional dialects, pronunciation differences, and cultural nuances. Languages with similar roots—like Spanish, Italian, Portuguese, and French—were easier to integrate due to shared grammar and vocabulary. In contrast, structurally different languages like German posed greater challenges, particularly in handling complex sentence structures, idiomatic expressions, and subtle tonal shifts. Currently, the system translates many phrases literally, sometimes leading to humorous or awkward results. However, the team is optimistic that future updates powered by advanced large language models will improve contextual understanding, enabling more accurate translations that capture tone, irony, and intent.
