ClipSitu: Effectively Leveraging CLIP for Conditional Predictions in Situation Recognition

Situation Recognition is the task of generating a structured summary of whatis happening in an image using an activity verb and the semantic roles playedby actors and objects. In this task, the same activity verb can describe adiverse set of situations as well as the same actor or object category can playa diverse set of semantic roles depending on the situation depicted in theimage. Hence a situation recognition model needs to understand the context ofthe image and the visual-linguistic meaning of semantic roles. Therefore, weleverage the CLIP foundational model that has learned the context of images vialanguage descriptions. We show that deeper-and-wider multi-layer perceptron(MLP) blocks obtain noteworthy results for the situation recognition task byusing CLIP image and text embedding features and it even outperforms thestate-of-the-art CoFormer, a Transformer-based model, thanks to the externalimplicit visual-linguistic knowledge encapsulated by CLIP and the expressivepower of modern MLP block designs. Motivated by this, we design across-attention-based Transformer using CLIP visual tokens that model therelation between textual roles and visual entities. Our cross-attention-basedTransformer known as ClipSitu XTF outperforms existing state-of-the-art by alarge margin of 14.1\% on semantic role labelling (value) for top-1 accuracyusing imSitu dataset. {Similarly, our ClipSitu XTF obtains state-of-the-artsituation localization performance.} We will make the code publicly available.