19 days ago

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin

View Paper Details View Code

Abstract

We present Qwen3-Omni, a single multimodal model that, for the first time,maintains state-of-the-art performance across text, image, audio, and videowithout any degradation relative to single-modal counterparts. Qwen3-Omnimatches the performance of same-sized single-modal models within the Qwenseries and excels particularly on audio tasks. Across 36 audio and audio-visualbenchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overallSOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro,Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoEarchitecture that unifies perception and generation across text, images, audio,and video, yielding fluent text and natural real-time speech. It supports textinteraction in 119 languages, speech understanding in 19 languages, and speechgeneration in 10 languages. To reduce first-packet latency in streamingsynthesis, Talker autoregressively predicts discrete speech codecs using amulti-codebook scheme. Leveraging the representational capacity of thesecodebooks, we replace computationally intensive block-wise diffusion with alightweight causal ConvNet, enabling streaming from the first codec frame. Incold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packetlatency of 234 ms. To further strengthen multimodal reasoning, we introduce aThinking model that explicitly reasons over inputs from any modality. Since theresearch community currently lacks a general-purpose audio captioning model, wefine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, whichproduces detailed, low-hallucination captions for arbitrary audio inputs.Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, andQwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0license.