9 days ago

MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization

Xingxuan Li, Yao Xiao, Dianwen Ng, Hai Ye, Yue Deng, Xiang Lin, Bin Wang, Zhanfeng Mo, Chong Zhang, Yueyi Zhang, Zonglin Yang, Ruilin Li, Lei Lei, Shihao Xu, Han Zhao, Weiling Chen, Feng Ji, Lidong Bing

View Paper Details

MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via
Context-Aware Multi-Stage Policy Optimization

Abstract

Large language models have recently evolved from fluent text generation toadvanced reasoning across diverse domains, giving rise to reasoning languagemodels. Among these domains, mathematical reasoning serves as a representativebenchmark as it requires precise multi-step logic and abstract reasoning, whichcan be generalized to other tasks. While closed-source RLMs such as GPT-o3demonstrate impressive reasoning capabilities, their proprietary nature limitstransparency and reproducibility. Although many open-source projects aim toclose this gap, most of them lack sufficient openness by omitting criticalresources such as datasets and detailed training configurations, which hindersreproducibility. To contribute toward greater transparency in RLM development,we introduce the MiroMind-M1 series, a set of fully open-source RLMs built onthe Qwen-2.5 backbone that match or exceed the performance of existingopen-source RLMs. Specifically, our models are trained in two stages: SFT on acarefully curated corpus of 719K math-reasoning problems with verified CoTtrajectories, followed by RLVR on 62K challenging and verifiable problems. Toenhance the robustness and efficiency of the RLVR process, we introduceContext-Aware Multi-Stage Policy Optimization, an algorithm that integrateslength-progressive training with an adaptive repetition penalty to encouragecontext-aware RL training. Our model achieves state-of-the-art or competitiveperformance and superior token efficiency among Qwen-2.5-based open-source 7Band 32B models on the AIME24, AIME25, and MATH benchmarks. To facilitatereproducibility, we release the complete stack: models (MiroMind-M1-SFT-7B,MiroMind-M1-RL-7B, MiroMind-M1-RL-32B); datasets (MiroMind-M1-SFT-719K,MiroMind-M1-RL-62K); and all training and evaluation configurations. We hopethese resources will support further research and foster community advancement.