HyperAI

The increasing number of actions in the real world makes it difficult for traditional deep-learning models to recognize unseen actions. Recently, pretrained contrastive image-based visual-language (I-VL) models have been adapted for efficient “zero-shot” scene understanding. Pairing such models with transformers to implement temporal modeling has been rewarding for zero-shot action recognition (ZSAR). However, the significance of modeling the local spatial context of objects and action environments remains unexplored. In this work, we propose a ZSAR framework called LoCATe-GAT, comprising a novel Local Context-Aggregating Temporal transformer (LoCATe) and a Graph Attention Network (GAT). Specifically, image and text encodings extracted from a pretrained I-VL model are used as inputs for LoCATe-GAT. Motivated by the observation that object-centric and environmental contexts drive both distinguishability and functional similarity between actions, LoCATe captures multi-scale local context using dilated convolutional layers during temporal modeling. Furthermore, the proposed GAT models semantic relationships between classes and achieves a strong synergy with the video embeddings produced by LoCATe. Extensive experiments on four widely-used benchmarks – UCF101, HMDB51, ActivityNet, and Kinetics – show we achieve state-of-the-art results. Specifically, we obtain relative gains of 3.8% and 4.8% on these datasets in conventional and 16.6% on UCF101in generalized ZSAR settings. For large-scale datasets like ActivityNet and Kinetics, our method achieves a relative gain of 31.8% and 27.9%, respectively, over the previous methods. Additionally, we gain 25.3% and 18.4%on UCF101 and HMDB51 as per the recent “TruZe” evaluation protocol.

Benchmark	Methodology	Metrics
zero-shot-action-recognition-on-activitynet	LoCATe-GAT	Top-1 Accuracy: 73.8
zero-shot-action-recognition-on-hmdb51	LoCATe-GAT	Top-1 Accuracy: 50.7
zero-shot-action-recognition-on-kinetics	LoCATe-GAT	Top-1 Accuracy: 58.7
zero-shot-action-recognition-on-ucf101	LoCATe-GAT	Top-1 Accuracy: 76.0

LoCATe-GAT: Modeling Multi-Scale Local Context and Action Relationships for Zero-Shot Action Recognition

{Arijit Sur Divyam Singal Sandipan Sarma}

Abstract

Benchmarks

Build AI with AI

Hyper Newsletters

Command Palette

LoCATe-GAT: Modeling Multi-Scale Local Context and Action Relationships for Zero-Shot Action Recognition

{Arijit Sur Divyam Singal Sandipan Sarma}

Abstract

Benchmarks

Build AI with AI

Hyper Newsletters