8 months ago

Multimodal Representation

Action Recognition

Computer Vision

Otniel-Bogdan Mercea Thomas Hummel A. Sophia Koepke Zeynep Akata

Abstract

Audio-visual generalised zero-shot learning for video classification requiresunderstanding the relations between the audio and visual information in orderto be able to recognise samples from novel, previously unseen classes at testtime. The natural semantic and temporal alignment between audio and visual datain video data can be exploited to learn powerful representations thatgeneralise to unseen classes at test time. We propose a multi-modal andTemporal Cross-attention Framework (\modelName) for audio-visual generalisedzero-shot learning. Its inputs are temporally aligned audio and visual featuresthat are obtained from pre-trained networks. Encouraging the framework to focuson cross-modal correspondence across time instead of self-attention within themodalities boosts the performance significantly. We show that our proposedframework that ingests temporal features yields state-of-the-art performance onthe \ucf, \vgg, and \activity benchmarks for (generalised) zero-shot learning.Code for reproducing all results is available at\url{https://github.com/ExplainableML/TCAF-GZSL}.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Multimodal Representation

Action Recognition

Computer Vision

Otniel-Bogdan Mercea Thomas Hummel A. Sophia Koepke Zeynep Akata

Abstract

Audio-visual generalised zero-shot learning for video classification requiresunderstanding the relations between the audio and visual information in orderto be able to recognise samples from novel, previously unseen classes at testtime. The natural semantic and temporal alignment between audio and visual datain video data can be exploited to learn powerful representations thatgeneralise to unseen classes at test time. We propose a multi-modal andTemporal Cross-attention Framework (\modelName) for audio-visual generalisedzero-shot learning. Its inputs are temporally aligned audio and visual featuresthat are obtained from pre-trained networks. Encouraging the framework to focuson cross-modal correspondence across time instead of self-attention within themodalities boosts the performance significantly. We show that our proposedframework that ingests temporal features yields state-of-the-art performance onthe \ucf, \vgg, and \activity benchmarks for (generalised) zero-shot learning.Code for reproducing all results is available at\url{https://github.com/ExplainableML/TCAF-GZSL}.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp