Rethinking the Two-Stage Framework for Grounded Situation Recognition

Grounded Situation Recognition (GSR), i.e., recognizing the salient activity(or verb) category in an image (e.g., buying) and detecting all correspondingsemantic roles (e.g., agent and goods), is an essential step towards"human-like" event understanding. Since each verb is associated with a specificset of semantic roles, all existing GSR methods resort to a two-stageframework: predicting the verb in the first stage and detecting the semanticroles in the second stage. However, there are obvious drawbacks in both stages:1) The widely-used cross-entropy (XE) loss for object recognition isinsufficient in verb classification due to the large intra-class variation andhigh inter-class similarity among daily activities. 2) All semantic roles aredetected in an autoregressive manner, which fails to model the complex semanticrelations between different roles. To this end, we propose a novel SituFormerfor GSR which consists of a Coarse-to-Fine Verb Model (CFVM) and aTransformer-based Noun Model (TNM). CFVM is a two-step verb prediction model: acoarse-grained model trained with XE loss first proposes a set of verbcandidates, and then a fine-grained model trained with triplet loss re-ranksthese candidates with enhanced verb features (not only separable but alsodiscriminative). TNM is a transformer-based semantic role detection model,which detects all roles parallelly. Owing to the global relation modelingability and flexibility of the transformer decoder, TNM can fully explore thestatistical dependency of the roles. Extensive validations on the challengingSWiG benchmark show that SituFormer achieves a new state-of-the-art performancewith significant gains under various metrics. Code is available athttps://github.com/kellyiss/SituFormer.