Step-Audio 2 Technical Report

This paper presents Step-Audio~2, an end-to-end multi-modal large languagemodel designed for industry-strength audio understanding and speechconversation. By integrating a latent audio encoder and reasoning-centricreinforcement learning (RL), Step-Audio 2 achieves promising performance inautomatic speech recognition (ASR) and audio understanding. To facilitategenuine end-to-end speech conversation, Step-Audio 2 incorporates thegeneration of discrete audio tokens into language modeling, significantlyenhancing its responsiveness to paralinguistic information such as speakingstyles and emotions. To effectively leverage the rich textual and acousticknowledge in real-world data, Step-Audio 2 integrates retrieval-augmentedgeneration (RAG) and is able to call external tools such as web search tomitigate hallucination and audio search to switch timbres. Trained on millionsof hours of speech and audio data, Step-Audio 2 delivers intelligence andexpressiveness across diverse conversational scenarios. Evaluation resultsdemonstrate that Step-Audio 2 achieves state-of-the-art performance on variousaudio understanding and conversational benchmarks compared to other open-sourceand commercial solutions. Please visithttps://github.com/stepfun-ai/Step-Audio2 for more information.