8 months ago

Abstract

Weakly supervised violence detection refers to the technique of trainingmodels to identify violent segments in videos using only video-level labels.Among these approaches, multimodal violence detection, which integratesmodalities such as audio and optical flow, holds great potential. Existingmethods in this domain primarily focus on designing multimodal fusion models toaddress modality discrepancies. In contrast, we take a different approach;leveraging the inherent discrepancies across modalities in violence eventrepresentation to propose a novel multimodal semantic feature alignment method.This method sparsely maps the semantic features of local, transient, and lessinformative modalities ( such as audio and optical flow ) into the moreinformative RGB semantic feature space. Through an iterative process, themethod identifies the suitable no-zero feature matching subspace and aligns themodality-specific event representations based on this subspace, enabling thefull exploitation of information from all modalities during the subsequentmodality fusion stage. Building on this, we design a new weakly supervisedviolence detection framework that consists of unimodal multiple-instancelearning for extracting unimodal semantic features, multimodal alignment,multimodal fusion, and final detection. Experimental results on benchmarkdatasets demonstrate the effectiveness of our method, achieving an averageprecision (AP) of 86.07% on the XD-Violence dataset. Our code is available athttps://github.com/xjpp2016/MAVD.

Source PDF