VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization

Li, Yunxin ; Chen, Xinyu ; Li, Zitao ; Liu, Zhenyu ; Wang, Longyue ; Luo, Wenhan ; Hu, Baotian ; Zhang, Min

Veröffentlichungsdatum: 5/28/2025

VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied
Iterative Policy Optimization

Abstract

Applying Reinforcement Learning (RL) to Video Large Language Models(Video-LLMs) shows significant promise for complex video reasoning. However,popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based GroupRelative Policy Optimization (GRPO), are limited by data preparationbottlenecks (e.g., noise or high cost) and exhibit unstable improvements in thequality of long chain-of-thoughts (CoTs) and downstream performance.To addressthese limitations, we propose VerIPO, a Verifier-guided Iterative PolicyOptimization method designed to gradually improve video LLMs' capacity forgenerating deep, long-term reasoning chains. The core component isRollout-Aware Verifier, positioned between the GRPO and Direct PreferenceOptimization (DPO) training phases to form the GRPO-Verifier-DPO training loop.This verifier leverages small LLMs as a judge to assess the reasoning logic ofrollouts, enabling the construction of high-quality contrastive data, includingreflective and contextually consistent CoTs. These curated preference samplesdrive the efficient DPO stage (7x faster than GRPO), leading to markedimprovements in reasoning chain quality, especially in terms of length andcontextual consistency. This training loop benefits from GRPO's expansivesearch and DPO's targeted optimization. Experimental results demonstrate: 1)Significantly faster and more effective optimization compared to standard GRPOvariants, yielding superior performance; 2) Our trained models exceed thedirect inference of large-scale instruction-tuned Video-LLMs, producing longand contextually consistent CoTs on diverse video reasoning tasks; and 3) Ourmodel with one iteration outperforms powerful LMMs (e.g., Kimi-VL) and longreasoning models (e.g., Video-R1), highlighting its effectiveness andstability.

Details der Forschungsarbeit anzeigen