Hyperbolic Audio-visual Zero-shot Learning

Audio-visual zero-shot learning aims to classify samples consisting of a pairof corresponding audio and video sequences from classes that are not presentduring training. An analysis of the audio-visual data reveals a large degree ofhyperbolicity, indicating the potential benefit of using a hyperbolictransformation to achieve curvature-aware geometric learning, with the aim ofexploring more complex hierarchical data structures for this task. The proposedapproach employs a novel loss function that incorporates cross-modalityalignment between video and audio features in the hyperbolic space.Additionally, we explore the use of multiple adaptive curvatures for hyperbolicprojections. The experimental results on this very challenging task demonstratethat our proposed hyperbolic approach for zero-shot learning outperforms theSOTA method on three datasets: VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSLachieving a harmonic mean (HM) improvement of around 3.0%, 7.0%, and 5.3%,respectively.