Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, Lichao Sun

Release Date: 5/29/2025

Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

Abstract

Improving Multi-modal Large Language Models (MLLMs) in the post-trainingstage typically relies on supervised fine-tuning (SFT) or reinforcementlearning (RL). However, these supervised methods require expensive and manuallyannotated multi-modal data--an ultimately unsustainable resource. While recentefforts have explored unsupervised post-training, their methods are complex anddifficult to iterate. In this work, we are the first to investigate the use ofGRPO, a stable and scalable online RL algorithm, for enabling continualself-improvement without any external supervision. We propose MM-UPT, a simpleyet effective framework for unsupervised post-training of MLLMs. MM-UPT buildsupon GRPO, replacing traditional reward signals with a self-rewarding mechanismbased on majority voting over multiple sampled responses. Our experimentsdemonstrate that MM-UPT significantly improves the reasoning ability ofQwen2.5-VL-7B (e.g., 66.3 %rightarrow72.9 % on MathVista, 62.9%rightarrow68.7 % on We-Math), using standard dataset without ground truthlabels. MM-UPT also outperforms prior unsupervised baselines and evenapproaches the results of supervised GRPO. Furthermore, we show thatincorporating synthetic questions, generated solely by MLLM itself, can boostperformance as well, highlighting a promising approach for scalableself-improvement. Overall, MM-UPT offers a new paradigm for continual,autonomous enhancement of MLLMs in the absence of external supervision. Ourcode is available at https://github.com/waltonfuture/MM-UPT.

View Paper Details