MiMo-VL Technical Report

We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-languagemodels delivering state-of-the-art performance in both general visualunderstanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7Bon 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassingmodels with up to 78B parameters. For GUI grounding applications, it sets a newstandard with 56.1 on OSWorld-G, even outperforming specialized models such asUI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens)with Mixed On-policy Reinforcement Learning (MORL) integrating diverse rewardsignals. We identify the importance of incorporating high-quality reasoningdata with long Chain-of-Thought into pre-training stages, and the benefits ofmixed RL despite challenges in simultaneous multi-domain optimization. We alsocontribute a comprehensive evaluation suite covering 50+ tasks to promotereproducibility and advance the field. The model checkpoints and fullevaluation suite are available at https://github.com/XiaomiMiMo/MiMo-VL.