Grounded Situation Recognition with Transformers

Grounded Situation Recognition (GSR) is the task that not only classifies asalient action (verb), but also predicts entities (nouns) associated withsemantic roles and their locations in the given image. Inspired by theremarkable success of Transformers in vision tasks, we propose a GSR modelbased on a Transformer encoder-decoder architecture. The attention mechanism ofour model enables accurate verb classification by capturing high-level semanticfeature of an image effectively, and allows the model to flexibly deal with thecomplicated and image-dependent relations between entities for improved nounclassification and localization. Our model is the first Transformerarchitecture for GSR, and achieves the state of the art in every evaluationmetric on the SWiG benchmark. Our code is available athttps://github.com/jhcho99/gsrtr .