Command Palette
Search for a command to run...
MM-Zero:ゼロデータからの自己進化型マルチモデル視覚言語モデル
MM-Zero:ゼロデータからの自己進化型マルチモデル視覚言語モデル
概要
自己進化は、大規模言語モデル(LLM)や視覚言語モデル(VLM)などの基盤モデルを、最小限の人間介入で改善するための重要なパラダイムとして浮上している。近年のアプローチは、LLM エージェントがほぼゼロのデータから自己進化し得ることを示してきたが、VLM は追加的な視覚モダリティを有するため、自己進化プロセスをブートストラップするには少なくとも何らかのシードデータ(例えば画像)が必要となる場合が一般的である。本研究では、VLM 推論におけるゼロデータ自己進化を達成する初の強化学習ベースのフレームワーク「Multi-model Multimodal Zero(MM-Zero)」を提案する。従来の二重役割(提案者と解決者)の構成を超え、MM-Zero は三つの専門役割からなるマルチロール自己進化トレーニングフレームワークを導入する。すなわち、抽象的な視覚概念を生成し質問を立案する「提案者」、これらの概念を実行可能なコード(Python や SVG など)に変換して視覚画像をレンダリングする「コーダー」、生成された視覚コンテンツに対してマルチモーダル推論を実行する「解決者」である。これら三つの役割は同一のベースモデルから初期化され、実行フィードバック、視覚検証、難易度バランスを統合する慎重に設計された報酬メカニズムを用いて、グループ相対方策最適化(GRPO)により訓練される。実験結果により、MM-Zero が多様なマルチモーダルベンチマークにおいて VLM の推論性能を向上させることが示された。MM-Zero は、マルチモーダルモデル向けに自己進化型マルチモデルシステムへのスケーラブルな道筋を確立し、自己改善のフロンティアを従来の二モデルパラダイムを超えて拡張するものである。
One-sentence Summary
Researchers from the University of Maryland, Brown University, and NVIDIA introduce MM-Zero, the first reinforcement learning framework enabling vision-language models to self-evolve without external data by employing a novel tri-role system of Proposer, Coder, and Solver to generate and reason over synthetic visual content.
Key Contributions
- MM-Zero addresses the bottleneck of requiring seed image data for Vision Language Model self-evolution by introducing the first framework to achieve zero-data training through autonomous visual content generation.
- The method replaces traditional dual-role setups with a novel tri-role pipeline where a Proposer creates abstract concepts, a Coder renders them into executable code, and a Solver performs reasoning, all optimized via Group Relative Policy Optimization.
- Experiments on Qwen3-VL and Mimo-VL models demonstrate that this approach yields consistent performance improvements across diverse multimodal benchmarks without relying on any external human-annotated datasets.
Introduction
Self-evolving paradigms offer a scalable path to improve Vision Language Models (VLMs) by reducing reliance on costly human-annotated data, yet existing methods remain bottlenecked by their dependence on static seed image datasets. Prior approaches typically adapt dual-role proposer-solver frameworks that can only iterate within the fixed distribution of pre-collected images, limiting the diversity and complexity of generated training scenarios. The authors leverage a novel tri-role reinforcement learning framework called MM-Zero that achieves true zero-data self-evolution by introducing a specialized Coder role to programmatically render visual content from abstract concepts. This system enables a Proposer, Coder, and Solver to interact in a closed loop where the model generates its own visual training data and reasoning tasks without any external inputs, significantly expanding the frontier of autonomous multimodal learning.
Method
The authors present MM-Zero, a self-evolving framework for Multimodal Large Language Models (MLLMs) that utilizes Reinforcement Learning with Verifiable Rewards (RLVR). The system is composed of three distinct model agents evolved from the same base model: a Proposer (πP), a Coder (πD), and a Solver (πS). These agents operate in a closed training loop where each role is optimized sequentially via Group Relative Policy Optimization (GRPO) while the others remain frozen.
Refer to the framework diagram to understand the interaction between these components. The Proposer generates a quadruple consisting of a fine-grained textual description, an easy question with a known answer, and a hard question requiring multi-step reasoning. The Coder converts the textual description into executable code (specifically SVG) to render a figure. The Solver then processes the rendered image. It first answers the easy question to verify semantic correctness, providing a reward signal to update the Coder. Subsequently, it answers the hard question using majority voting to generate pseudo-labels for its own training while providing a difficulty reward to optimize the Proposer.

The training pipeline involves an iterative evolution of the models. As shown in the figure below, the Coder and Proposer improve over iterations (Iter 1 to Iter 3), generating increasingly complex visual content and questions. For instance, the Coder evolves from rendering simple stacked bar charts to complex geometric constructions with multiple overlapping circles. The Proposer evolves to generate more detailed captions and harder questions that push the Solver's reasoning capabilities. To ensure training quality, the authors apply stage-specific data filters. For the Coder, they retain examples where the rendering success rate falls within a specific range, excluding trivially simple or impossible tasks. For the Solver, they keep examples where easy-question accuracy is high but hard-question accuracy remains in a challenging range, ensuring the model is trained on data of appropriate difficulty.

The reward formulation is central to the self-evolving process. The Proposer receives a hierarchical reward Rp(x) that validates formatting, solvability, and difficulty. This includes a code execution indicator, a solvability score based on the Solver's accuracy on the easy question, and a difficulty score based on the Solver's self-consistency on the hard question. The difficulty score follows the Goldilocks principle, peaking when the Solver is maximally uncertain. Additionally, penalties are applied for easy-hard mismatches and lack of content diversity.
The Coder is rewarded based on execution status, semantic correctness (solvability of the easy question), and task feasibility (difficulty of the hard question). The Solver, trained on hard questions without ground truth labels, utilizes Test-Time Reinforcement Learning (TTRL). It generates multiple reasoning paths and identifies a silver answer via majority vote. The reward for the Solver is a weighted sum of answer accuracy against this consensus and structural validity, ensuring the model adheres to a Chain-of-Thought format followed by a boxed final answer.
The authors adopt Group Relative Policy Optimization (GRPO) to update the policies. Given a prompt p, the current policy generates a group of N responses with corresponding rewards. These rewards are normalized within the group to yield response-level advantages A^i, which are used to maximize a clipped surrogate objective regularized with a KL divergence term. This approach allows the system to improve reasoning and generation quality without requiring a learned value function.
Experiment
- Solver evaluation across general visual reasoning, mathematical visual reasoning, and hallucination detection benchmarks validates that the proposed framework improves model performance without external data, with the most significant gains observed in complex visual math tasks.
- Experiments on multiple model sizes demonstrate that the method generalizes effectively, though models with stronger base capabilities and higher image rendering success rates achieve greater improvements.
- Qualitative analysis of training iterations reveals a clear evolution where generated images transition from cluttered and unreadable to polished and faithful, while questions progress from trivial value extraction to requiring genuine multi-step compositional reasoning.
- Ablation studies confirm that capping solvability rewards prevents the model from exploiting shortcuts by embedding answers directly in images, while enforcing content diversity avoids overfitting to narrow visual types like histograms.
- Continued training beyond initial iterations shows that performance does not saturate, indicating a promising path for self-evolving multimodal models to improve reasoning capabilities autonomously.