MiMo-VL 技术报告
Xiaomi LLM-Core Team, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song, Zhenbo Luo, Yue Yu, Yudong Wang, Yuanyuan Tian, Yu Tu, Yihan Yan, Yi Huang, Xu Wang, Xinzhe Xu, Xingchen Song, Xing Zhang, Xing Yong, Xin Zhang, Xiangwei Deng, Wenyu Yang, Wenhan Ma, Weiwei Lv, Weiji Zhuang, Wei Liu, Sirui Deng, Shuo Liu, Shimao Chen, Shihua Yu, Shaohui Liu, Shande Wang, Rui Ma, Qiantong Wang, Peng Wang, Nuo Chen, Menghang Zhu, Kangyang Zhou, Kang Zhou, Kai Fang, Jun Shi, Jinhao Dong, Jiebao Xiao, Jiaming Xu, Huaqiu Liu, Hongshen Xu, Heng Qu, Haochen Zhao, Hanglong Lv, Guoan Wang, Duo Zhang, Dong Zhang, Di Zhang, Chong Ma, Chang Liu, Can Cai, Bingquan Xia
发布日期: 6/5/2025

摘要
我们开源了MiMo-VL-7B-SFT和MiMo-VL-7B-RL,这两款强大的视觉语言模型在通用视觉理解和多模态推理方面均达到了最先进的性能。MiMo-VL-7B-RL在评估的40项任务中,有35项超过了Qwen2.5-VL-7B,并在OlympiadBench上获得了59.4分,超越了参数量高达780亿的模型。对于GUI定位应用,它在OSWorld-G上的表现达到了56.1分,甚至超过了专门用于此类任务的模型如UI-TARS。我们的训练方法结合了四阶段预训练(2.4万亿个标记)和混合在线强化学习(MORL),后者集成了多种奖励信号。我们发现,在预训练阶段中引入高质量且包含长链思考过程的数据非常重要,并且尽管在同时优化多个领域时面临挑战,混合强化学习仍具有显著优势。此外,我们还贡献了一套涵盖50多项任务的全面评估工具,以促进可重复性和推动该领域的发展。模型检查点和完整的评估工具包可在https://github.com/XiaomiMiMo/MiMo-VL获取。