HyperAIHyperAI
2 months ago

Temporal and cross-modal attention for audio-visual zero-shot learning

Mercea, Otniel-Bogdan ; Hummel, Thomas ; Koepke, A. Sophia ; Akata, Zeynep
Temporal and cross-modal attention for audio-visual zero-shot learning
Abstract

Audio-visual generalised zero-shot learning for video classification requiresunderstanding the relations between the audio and visual information in orderto be able to recognise samples from novel, previously unseen classes at testtime. The natural semantic and temporal alignment between audio and visual datain video data can be exploited to learn powerful representations thatgeneralise to unseen classes at test time. We propose a multi-modal andTemporal Cross-attention Framework (\modelName) for audio-visual generalisedzero-shot learning. Its inputs are temporally aligned audio and visual featuresthat are obtained from pre-trained networks. Encouraging the framework to focuson cross-modal correspondence across time instead of self-attention within themodalities boosts the performance significantly. We show that our proposedframework that ingests temporal features yields state-of-the-art performance onthe \ucf, \vgg, and \activity benchmarks for (generalised) zero-shot learning.Code for reproducing all results is available at\url{https://github.com/ExplainableML/TCAF-GZSL}.