MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

Large multimodal reasoning models have achieved rapid progress, but theiradvancement is constrained by two major limitations: the absence of open,large-scale, high-quality long chain-of-thought (CoT) data, and the instabilityof reinforcement learning (RL) algorithms in post-training. Group RelativePolicy Optimization (GRPO), the standard framework for RL fine-tuning, is proneto gradient vanishing when reward variance is low, which weakens optimizationsignals and impairs convergence. This work makes three contributions: (1) Wepropose Variance-Aware Sampling (VAS), a data selection strategy guided byVariance Promotion Score (VPS) that combines outcome variance and trajectorydiversity to promote reward variance and stabilize policy optimization. (2) Werelease large-scale, carefully curated resources containing ~1.6M long CoTcold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty,and diversity, along with a fully reproducible end-to-end training codebase.(3) We open-source a family of multimodal reasoning models in multiple scales,establishing standardized baselines for the community. Experiments acrossmathematical reasoning benchmarks demonstrate the effectiveness of both thecurated data and the proposed VAS. Comprehensive ablation studies and analysesprovide further insight into the contributions of each component. In addition,we theoretically establish that reward variance lower-bounds the expectedpolicy gradient magnitude, with VAS serving as a practical mechanism to realizethis guarantee. Our code, data, and checkpoints are available athttps://github.com/LengSicong/MMR1.