Qwen3-Omni Technical Report

We present Qwen3-Omni, a single multimodal model that, for the first time,maintains state-of-the-art performance across text, image, audio, and videowithout any degradation relative to single-modal counterparts. Qwen3-Omnimatches the performance of same-sized single-modal models within the Qwenseries and excels particularly on audio tasks. Across 36 audio and audio-visualbenchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overallSOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro,Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoEarchitecture that unifies perception and generation across text, images, audio,and video, yielding fluent text and natural real-time speech. It supports textinteraction in 119 languages, speech understanding in 19 languages, and speechgeneration in 10 languages. To reduce first-packet latency in streamingsynthesis, Talker autoregressively predicts discrete speech codecs using amulti-codebook scheme. Leveraging the representational capacity of thesecodebooks, we replace computationally intensive block-wise diffusion with alightweight causal ConvNet, enabling streaming from the first codec frame. Incold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packetlatency of 234 ms. To further strengthen multimodal reasoning, we introduce aThinking model that explicitly reasons over inputs from any modality. Since theresearch community currently lacks a general-purpose audio captioning model, wefine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, whichproduces detailed, low-hallucination captions for arbitrary audio inputs.Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, andQwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0license.