HyperAIHyperAI

Command Palette

Search for a command to run...

맹점에서 성과로: 대규모 다중모달 모델을 위한 진단 기반 반복 훈련

Hongrui Jia Chaoya Jiang Shikun Zhang Wei Ye

초록

대규모 다중모달 모델(LMMs)이 확장되면서 강화학습(RL) 기법도 발전함에 따라, LMMs는 복잡한 추론과 의사결정 능력에서 두드러진 성과를 거두고 있다. 그러나 여전히 정적 데이터와 고정된 학습 절차에 의존하고 있어, 모델의 능력적 약점(성능 허점)을 진단하거나 동적으로 맞춤형 강화를 제공하는 데 어려움이 있다. 반복 연습보다 테스트 기반 오류 노출과 피드백 기반 보정이 더 우수한 성능을 보인다는 연구 결과를 바탕으로, 우리는 진단 기반 점진적 진화(Diagnostic-driven Progressive Evolution, DPE)를 제안한다. DPE는 진단이 데이터 생성과 강화를 이끄는 나선형 루프 구조를 가지며, 각 반복 과정에서 업데이트된 모델을 다시 진단하여 다음 단계의 타겟화된 개선을 유도한다. DPE는 두 가지 핵심 구성 요소로 이루어져 있다. 첫째, 다양한 에이전트들이 웹 검색, 이미지 편집 등의 도구를 활용해 대규모 비라벨링 다중모달 데이터를 주석화하고 품질을 관리하여 다양한 실재감 있는 샘플을 생성한다. 둘째, DPE는 실패 원인을 특정한 약점으로 할당하고, 데이터 혼합 비율을 동적으로 조정하며, 에이전트가 해당 약점을 중심으로 데이터를 생성하도록 유도함으로써 타겟 강화를 수행한다. Qwen3-VL-8B-Instruct 및 Qwen2.5-VL-7B-Instruct 모델에 대한 실험 결과, 11개 벤치마크에서 안정적이고 지속적인 성능 향상이 나타나, DPE가 개방형 작업 분포 하에서 지속적인 LMM 학습을 위한 확장 가능한 패러다임임을 시사한다. 본 연구의 코드, 모델, 데이터는 공개되어 있으며, https://github.com/hongruijia/DPE 에서 확인할 수 있다.

One-sentence Summary

Hongrui Jia and Chaoya Jiang et al. propose Diagnostic-driven Progressive Evolution (DPE), a self-improving loop that diagnoses LMM weaknesses and generates targeted multimodal data for reinforcement, outperforming static training across eleven benchmarks and enabling scalable, continual LMM evolution under open-ended tasks.

Key Contributions

  • DPE introduces a diagnostic-driven training loop for Large Multimodal Models that identifies capability blind spots and dynamically generates targeted, weakness-focused data using multi-agent tool-augmented annotation, overcoming limitations of static datasets and heuristic-based evolution.
  • Applied to Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct, DPE achieves stable, continual improvements across eleven multimodal reasoning benchmarks using only 1000 training examples per iteration, demonstrating efficiency and scalability under open task distributions.
  • Systematic analysis confirms that DPE’s diagnosis mechanism enhances training stability and mitigates long-tail performance degradation, offering a principled approach to continual model improvement without relying on expensive human annotations or fixed data recipes.

Introduction

The authors leverage diagnostic feedback to address key limitations in training Large Multimodal Models (LMMs), where prior self-evolution methods rely on heuristic signals and static visual data, leading to unstable training and poor long-tail performance. Existing frameworks lack interpretable failure attribution and struggle to generate diverse, targeted multimodal samples, causing models to plateau or regress on complex tasks like math or OCR. Their main contribution is Diagnostic-driven Progressive Evolution (DPE), a closed-loop training paradigm that diagnoses model weaknesses, dynamically generates tailored multimodal data using multi-agent tool use, and reinforces improvements iteratively—resulting in stable, broad gains across benchmarks with minimal data.

Method

The authors leverage Diagnostic-driven Progressive Evolution (DPE), a closed-loop training framework designed to enhance large multimodal models (LMMs) under conditions of scarce supervision and long-tail coverage gaps. Unlike prior self-evolution methods that rely on static image sets and heuristic signals, DPE iteratively executes diagnosis, targeted generation, and reinforcement-based updating. Each iteration explicitly controls both the category composition and question emphasis of the training data, aligning resources with the model’s current capability blind spots to mitigate instability and diminishing returns on long-tail skills.

At iteration kkk, the policy is denoted as πθ(k)\pi_{\theta^{(k)}}πθ(k). The framework constructs a training set T(k)\mathcal{T}^{(k)}T(k) and updates parameters to θ(k+1)\theta^{(k+1)}θ(k+1) via reinforcement learning with verifiable rewards:

θ(k+1) ⁣= ⁣ARL ⁣(θ(k);T(k)),T(k) ⁣= ⁣Agen ⁣(R(k)),R(k) ⁣= ⁣Adiag ⁣(πθ(k)),\theta ^ { ( k + 1 ) } \! = \! \mathcal { A } _ { \mathrm { R L } } \! \Big ( \theta ^ { ( k ) } ; \, \mathcal { T } ^ { ( k ) } \Big ) , \mathcal { T } ^ { ( k ) } \! = \! \mathcal { A } _ { \mathrm { g e n } } \! \Big ( \mathcal { R } ^ { ( k ) } \Big ) , \mathcal { R } ^ { ( k ) } \! = \! \mathcal { A } _ { \mathrm { d i a g } } \! \Big ( \pi _ { \theta ^ { ( k ) } } \Big ) ,θ(k+1)=ARL(θ(k);T(k)),T(k)=Agen(R(k)),R(k)=Adiag(πθ(k)),

where Adiag\mathcal{A}_{\text{diag}}Adiag, Agen\mathcal{A}_{\text{gen}}Agen, and ARL\mathcal{A}_{\text{RL}}ARL represent the diagnosis, generation, and RL-update operators, respectively, and R(k)\mathcal{R}^{(k)}R(k) is a structured diagnostic report.

The diagnostic mechanism initiates each iteration by performing explicit failure attribution and capability decomposition. It maps multimodal reasoning into a 12-dimensional capability space C={c1,c2,,cK}C = \{c_1, c_2, \ldots, c_K\}C={c1,c2,,cK}, including categories such as geometry images, medical images, statistical charts, and natural scenes. From a diagnostic pool Ddiag\mathcal{D}_{\text{diag}}Ddiag, the system samples N=200N = 200N=200 instances {(In,qn,an,cn)}n=1N\{(I_n, q_n, a_n, c_n)\}_{n=1}^{N}{(In,qn,an,cn)}n=1N, and the model generates responses y^nπθ(k)(In,qn)\hat{y}_n \sim \pi_{\theta^{(k)}}(\cdot \mid I_n, q_n)y^nπθ(k)(In,qn). Diagnostic agents score each response using a function v()v(\cdot)v() that evaluates both reasoning steps and final results, producing a scalar correctness signal znz_nzn. For each category ccc, the system computes counts and accuracy:

Nc=n=1NI[cn=c],Accc=1Ncn=1NI[cn=c]zn.N _ { c } = \sum _ { n = 1 } ^ { N } \mathbb { I } [ c _ { n } = c ] , \qquad \mathrm { A c c } _ { c } = \frac { 1 } { N _ { c } } \sum _ { n = 1 } ^ { N } \mathbb { I } [ c _ { n } = c ] \cdot z _ { n } .Nc=n=1NI[cn=c],Accc=Nc1n=1NI[cn=c]zn.

Beyond accuracy, agents analyze the error set Ec={ncn=c, zn=0}\mathcal{E}_c = \{n \mid c_n = c, \ z_n = 0\}Ec={ncn=c, zn=0} to summarize recurring failure patterns Fc\mathcal{F}_cFc, such as OCR misalignments or chart legend mismatches. These patterns are injected into the generation phase as executable prompts. The system then derives a category proportion vector α(k)\alpha^{(k)}α(k) by assigning unnormalized weights α~c\tilde{\alpha}_cα~c based on segmented accuracy ranges and normalizing:

αc(k)=α~cc=1Cα~c.\alpha _ { c } ^ { ( k ) } = \frac { \tilde { \alpha } _ { c } } { \sum _ { c ^ { \prime } = 1 } ^ { C } \tilde { \alpha } _ { c ^ { \prime } } } .αc(k)=c=1Cα~cα~c.

The final diagnostic report R(k)\mathcal{R}^{(k)}R(k) includes α(k)\alpha^{(k)}α(k), {Fc(k)}\{\mathcal{F}_c^{(k)}\}{Fc(k)}, and {Hc(k)}\{\mathcal{H}_c^{(k)}\}{Hc(k)}, where Hc(k)\mathcal{H}_c^{(k)}Hc(k) provides actionable generation instructions such as enforcing stricter answer formats or longer reasoning chains.

The Multiple Agents Questioner System translates R(k)\mathcal{R}^{(k)}R(k) into a training dataset T(k)={(Ij,qj,aj,cj)}j=1M\mathcal{T}^{(k)} = \{(I_j, q_j, a_j, c_j)\}_{j=1}^MT(k)={(Ij,qj,aj,cj)}j=1M with controllable distribution and verifiable answers. Given a target budget MMM, the system enforces a hard category quota constraint: for each category ccc, mc=Mαc(k)m_c = \left\lfloor M \cdot \alpha_c^{(k)} \right\rfloormc=Mαc(k), and the final dataset must satisfy:

(I,q,a,c)T(k)I[c=c]=mc,c{1,,C}.\sum _ { ( I , q , a , c ) \in \mathcal { T } ^ { ( k ) } } \mathbb { I } [ c = c ^ { \prime } ] = m _ { c ^ { \prime } } , \quad \forall c ^ { \prime } \in \{ 1 , \ldots , C \}.(I,q,a,c)T(k)I[c=c]=mc,c{1,,C}.

The system comprises four agents: Planner, Image Selector, Question Generator, and Validation. The Planner Agent outputs a plan for each sample jjj:

planj=(cj, reqjI, reqjQ, dirj),\mathrm { p l a n } _ { j } = \big ( c _ { j } , \ \mathrm { r e q } _ { j } ^ { I } , \ \mathrm { r e q } _ { j } ^ { Q } , \ \mathrm { d i r } _ { j } \big ),planj=(cj, reqjI, reqjQ, dirj),

where cjc_jcj is the target category, reqjI\mathrm{req}_j^IreqjI specifies image requirements, reqjQ\mathrm{req}_j^QreqjQ specifies question requirements, and dirj\mathrm{dir}_jdirj targets weaknesses derived from Fcj(k)\mathcal{F}_{c_j}^{(k)}Fcj(k) and Hcj(k)\mathcal{H}_{c_j}^{(k)}Hcj(k). The Image Selector Agent retrieves or composes images IjI_jIj from an external pool Pext\mathcal{P}_{\text{ext}}Pext using a pipeline ϕ()\phi(\cdot)ϕ() that includes search, filtering, and editing capabilities. The Question Generator Agent produces (qj,aj)(q_j, a_j)(qj,aj) given IjI_jIj and planning instructions:

(qj,aj)=ψ(Ij, reqiQ, Hci(k)).( q _ { j } , a _ { j } ) = \psi \big ( I _ { j } , \ \mathrm { r e q } _ { i } ^ { Q } , \ \mathcal { H } _ { c _ { i } } ^ { ( k ) } \big ).(qj,aj)=ψ(Ij, reqiQ, Hci(k)).

The Validation Agent gates sample quality using four checks: category consistency, solvability, answer verifiability, and format compliance. The final acceptance condition is:

g(si)=gcatgsolgvergfmt.g ( s _ { i } ) = g _ { \mathrm { c a t } } \cdot g _ { \mathrm { s o l } } \cdot g _ { \mathrm { v e r } } \cdot g _ { \mathrm { f m t } }.g(si)=gcatgsolgvergfmt.

If g(sj)=1g(s_j) = 1g(sj)=1, the sample is added to T(k)\mathcal{T}^{(k)}T(k) and the quota state is updated; otherwise, it is discarded and regenerated.

Training proceeds via GRPO. For each prompt xxx, the old policy πθold\pi_{\theta_{\text{old}}}πθold generates GGG trajectories yi=(oi,1,,oi,yi)πθold(x)y_i = (o_{i,1}, \ldots, o_{i,|y_i|}) \sim \pi_{\theta_{\text{old}}}(\cdot \mid x)yi=(oi,1,,oi,yi)πθold(x). Each trajectory receives a scalar reward ri=r(x,yi)r_i = r(x, y_i)ri=r(x,yi). GRPO optimizes the clipped surrogate objective:

JGRPO(θ)=ExD,{yi}πθold[1Gi=1G1yit=1yimin(ρi,tAi,t,clip(ρi,t,1ε,1+ε)Ai,t)    βKL(πθπinit)]\begin{array} { r l } { J _ { \mathrm { G R P O } } ( \theta ) = \mathbb { E } _ { x \sim \mathcal { D } , \, \{ y _ { i } \} \sim \pi _ { \theta _ { \mathrm { o l d } } } } \Bigg [ \frac { 1 } { G } \sum _ { i = 1 } ^ { G } \frac { 1 } { | y _ { i } | } \sum _ { t = 1 } ^ { | y _ { i } | } \operatorname* { m i n } \bigg ( \rho _ { i , t } A _ { i , t } , } & { } \\\\ { \mathrm { c l i p } ( \rho _ { i , t } , 1 - \varepsilon , 1 + \varepsilon ) \, A _ { i , t } \bigg ) \; - \; \beta \, \mathrm { K L } \big ( \pi _ { \theta } \parallel \pi _ { \mathrm { i n i t } } \big ) \Bigg ] } & { } \end{array}JGRPO(θ)=ExD,{yi}πθold[G1i=1Gyi1t=1yimin(ρi,tAi,t,clip(ρi,t,1ε,1+ε)Ai,t)βKL(πθπinit)]

where ρi,t=πθ(oi,tx,σi,<t)πθold(oi,tx,σi,<t)\rho_{i,t} = \frac{\pi_{\theta}(o_{i,t}|x,\sigma_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t}|x,\sigma_{i,<t})}ρi,t=πθold(oi,tx,σi,<t)πθ(oi,tx,σi,<t), ε\varepsilonε is the clipping threshold, β>0\beta > 0β>0 controls KL regularization, and πinit\pi_{\text{init}}πinit is a reference policy. A key innovation is the group-normalized advantage:

A^i=rimean(r1,,rG)std(r1,,rG).\hat { A } _ { i } = \frac { r _ { i } - \mathrm { m e a n } ( r _ { 1 } , \ldots , r _ { G } ) } { \mathrm { s t d } ( r _ { 1 } , \ldots , r _ { G } ) } .A^i=std(r1,,rG)rimean(r1,,rG).

From a maximum-entropy perspective, the optimal policy satisfies π(yx)πinit(yx)exp(r(x,y)/β)\pi ^ { * } ( y \mid x ) \propto \pi _ { \mathrm { i n i t } } ( y \mid x ) \exp ( r ( x , y ) / \beta )π(yx)πinit(yx)exp(r(x,y)/β), and the KL divergence admits a lower bound:

KL(πinitπ)p(x)(1p(x))2β2,\mathrm { K L } ( \pi _ { \mathrm { i n i t } } \parallel \pi ^ { * } ) \geq \frac { p ( x ) \big ( 1 - p ( x ) \big ) } { 2 \beta ^ { 2 } },KL(πinitπ)2β2p(x)(1p(x)),

where p(x)p(x)p(x) is the pass rate under πinit\pi_{\text{init}}πinit. This bound is maximized near p=0.5p = 0.5p=0.5, explaining why DPE retains only moderately difficult samples to improve learning efficiency.

At iteration kkk, DPE generates and validates T(k)\mathcal{T}^{(k)}T(k), applies difficulty-aware filtering to obtain Ttrain(k)\mathcal{T}_{\text{train}}^{(k)}Ttrain(k), and performs GRPO to update the model: θ(k+1)=ARL(θ(k);Ttrain(k))\theta^{(k+1)} = \mathcal{A}_{\text{RL}}\left(\theta^{(k)}; \mathcal{T}_{\text{train}}^{(k)}\right)θ(k+1)=ARL(θ(k);Ttrain(k)). The system then repeats the diagnostic round, progressively strengthening weak capabilities and expanding visual coverage through external image sources.

Experiment

  • DPE outperforms VisPlay in capability enhancement, training stability, and cross-model transferability, particularly excelling in STEM, OCR, and hallucination mitigation through a closed-loop diagnostic mechanism.
  • DPE achieves state-of-the-art results with parameter efficiency, surpassing larger models like Qwen2.5-VL-72B and GPT-4o in complex visual math and grounding tasks, highlighting the value of data quality over scale.
  • Ablation studies confirm DPE’s diagnostic module is essential for sustained improvement, preventing performance oscillation and guiding data generation toward true capability gaps.
  • DPE’s image retrieval and editing tools significantly expand visual diversity, preventing early plateaus and improving performance on OCR and math reasoning by covering long-tail visual patterns.
  • Generated data from DPE shows higher and more stable text and image diversity across iterations, avoiding template collapse and maintaining broad semantic and visual coverage.
  • Quality evaluations reveal DPE consistently produces high-quality, solvable, and visually grounded questions, while VisPlay’s output degrades over time, especially in correctness and structure.
  • Case studies illustrate DPE’s ability to generate complete, well-structured, and semantically grounded questions, unlike VisPlay’s incomplete or unanswerable examples.

The authors use a diagnostic-guided data evolution framework to iteratively improve vision-language models under low-data conditions, achieving consistent gains across diverse benchmarks including STEM, OCR, and hallucination mitigation. Results show that their method sustains stable performance growth across iterations while outperforming self-evolving baselines and larger state-of-the-art models, particularly in complex reasoning and grounding tasks. The approach proves effective across model scales and relies on targeted data generation rather than volume, with diagnostic feedback ensuring continuous alignment with model weaknesses.

The authors use a multi-agent system to generate training data iteratively, with DPE consistently producing higher-quality questions than VisPlay across all iterations, particularly in solvability and correctness. Results show DPE maintains stable, near-ceiling quality scores while VisPlay’s quality degrades over time, indicating DPE’s diagnostic guidance effectively sustains data reliability. This quality advantage directly supports more stable and effective model evolution compared to self-evolving baselines.

The authors use DPE to generate training data with higher and more stable text and image diversity compared to VisPlay, as measured by mean pairwise cosine distance across iterations. Results show DPE sustains diversity gains over time while VisPlay exhibits degradation, particularly in later iterations, indicating DPE’s mechanisms better prevent distribution collapse and template reversion. This enhanced diversity supports broader semantic and visual coverage, contributing to more robust model performance.

The authors use DPE to enhance Qwen3-VL-8B-Instruct under low-data conditions, achieving state-of-the-art performance across multiple benchmarks including visual math and hallucination mitigation. Results show DPE outperforms larger models like Qwen2.5-VL-72B and GPT-4o in key areas, demonstrating that targeted data generation and diagnostic feedback yield stronger gains than parameter scale alone. The method sustains stable improvements across iterations by focusing on model weaknesses and maintaining high data quality and diversity.

The authors use DPE to iteratively generate high-quality training data from a small seed set, achieving performance gains over static training despite using only 3K samples. Results show consistent improvements across multiple benchmarks, including MMMU, HallusionBench, MathVista, and RealWorldQA, indicating that targeted data generation based on diagnostic feedback enhances model capabilities more effectively than larger static datasets. The method demonstrates stable training dynamics and superior data efficiency, with gains sustained across iterations without performance regression.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp