8 days ago
Step-Audio 2 技术报告
Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen, Siyu Chen, Song Yuan, Xuelin Zhang, Yimin Jiang, Yu Zhou, Yuxiang Yang, Bingxin Li, Buyun Ma, Changhe Song, Dongqing Pang, Guoqiang Hu, Haiyang Sun, Kang An, Na Wang, Shuli Gao, Wei Ji, Wen Li, Wen Sun, Xuan Wen, Yong Ren, Yuankai Ma, Yufan Lu, Bin Wang, Bo Li, Changxin Miao, Che Liu, Chen Xu, Dapeng Shi, Dingyuan Hu, Donghang Wu, Enle Liu, Guanzhe Huang, Gulin Yan, Han Zhang, Hao Nie, Haonan Jia, Hongyu Zhou, Jianjian Sun, Jiaoren Wu, Jie Wu, Jie Yang, Jin Yang, Junzhe Lin, Kaixiang Li, Lei Yang, Liying Shi, Li Zhou, Longlong Gu, Ming Li, Mingliang Li, Mingxiao Li, Nan Wu, Qi Han, Qinyuan Tan, Shaoliang Pang, Shengjie Fan, Siqi Liu, Tiancheng Cao, Wanying Lu, Wenqing He, Wuxun Xie, Xu Zhao, Xueqi Li, Yanbo Yu, Yang Yang, Yi Liu, Yifan Lu, Yilei Wang, Yuanhao Ding, Yuanwei Liang, Yuanwei Lu, Yuchu Luo, Yuhe Yin, Yumeng Zhan, Yuxiang Zhang, Zidong Yang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Xiangyu Zhang, Yibo Zhu

摘要
本文介绍了 Step-Audio~2,这是一个面向工业级音频理解和语音对话的端到端多模态大语言模型。通过融合潜在音频编码器和以推理为核心的强化学习(Reinforcement Learning, RL),Step-Audio 2 在自动语音识别(Automatic Speech Recognition, ASR)和音频理解任务中表现出色。为了实现真正的端到端语音对话,Step-Audio 2 将离散音频标记的生成整合到语言建模中,显著提升了其对副语言信息(如说话风格和情感)的响应能力。为有效利用真实数据中丰富的文本和声学知识,Step-Audio 2 集成了检索增强生成(Retrieval-Augmented Generation, RAG)技术,并能够调用外部工具,如网络搜索,以减少幻觉现象,以及进行音频搜索以切换音色。Step-Audio 2 在数百万小时的语音和音频数据上进行训练,能够在多种对话场景中展现出智能与表现力。评估结果表明,与其它开源和商业解决方案相比,Step-Audio 2 在多个音频理解和对话基准测试中均达到了最先进水平。更多信息请访问 https://github.com/stepfun-ai/Step-Audio2。