Multi-scale Bottleneck Transformer for Weakly Supervised Multimodal Violence Detection

Weakly supervised multimodal violence detection aims to learn a violencedetection model by leveraging multiple modalities such as RGB, optical flow,and audio, while only video-level annotations are available. In the pursuit ofeffective multimodal violence detection (MVD), information redundancy, modalityimbalance, and modality asynchrony are identified as three key challenges. Inthis work, we propose a new weakly supervised MVD method that explicitlyaddresses these challenges. Specifically, we introduce a multi-scale bottlenecktransformer (MSBT) based fusion module that employs a reduced number ofbottleneck tokens to gradually condense information and fuse each pair ofmodalities and utilizes a bottleneck token-based weighting scheme to highlightmore important fused features. Furthermore, we propose a temporal consistencycontrast loss to semantically align pairwise fused features. Experiments on thelargest-scale XD-Violence dataset demonstrate that the proposed method achievesstate-of-the-art performance. Code is available athttps://github.com/shengyangsun/MSBT.