Command Palette
Search for a command to run...
TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
Yunheng Li Jing Cheng Shaoyong Jia Hangyi Kuang Shaohui Jiao Qibin Hou Ming-Ming Cheng

Abstract
This paper introduces TempSamp-R1, a new reinforcement fine-tuning frameworkdesigned to improve the effectiveness of adapting multimodal large languagemodels (MLLMs) to video temporal grounding tasks. We reveal that existingreinforcement learning methods, such as Group Relative Policy Optimization(GRPO), rely on on-policy sampling for policy updates. However, in tasks withlarge temporal search spaces, this strategy becomes both inefficient andlimited in performance, as it often fails to identify temporally accuratesolutions. To address this limitation, TempSamp-R1 leverages ground-truthannotations as off-policy supervision to provide temporally precise guidance,effectively compensating for the sparsity and misalignment in on-policysolutions. To further stabilize training and reduce variance in reward-basedupdates, TempSamp-R1 provides a non-linear soft advantage computation methodthat dynamically reshapes the reward feedback via an asymmetric transformation.By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1optimizes a single unified model to support both CoT and non-CoT inferencemodes, enabling efficient handling of queries with varying reasoningcomplexity. Experimental results demonstrate that TempSamp-R1 outperformsGRPO-based baselines, establishing new state-of-the-art performance onbenchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions(R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover,TempSamp-R1 shows robust few-shot generalization capabilities under limiteddata. Code: https://github.com/HVision-NKU/TempSamp-R1
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.