2 months ago

ViTGaze: Gaze Following with Interaction Features in Vision Transformers

Song, Yuehao ; Wang, Xinggang ; Yao, Jingfeng ; Liu, Wenyu ; Zhang, Jinglin ; Xu, Xiangmin

Abstract

Gaze following aims to interpret human-scene interactions by predicting theperson's focal point of gaze. Prevailing approaches often adopt a two-stageframework, whereby multi-modality information is extracted in the initial stagefor gaze target prediction. Consequently, the efficacy of these methods highlydepends on the precision of the preceding modality extraction. Others use asingle-modality approach with complex decoders, increasing networkcomputational load. Inspired by the remarkable success of pre-trained plainvision transformers (ViTs), we introduce a novel single-modality gaze followingframework called ViTGaze. In contrast to previous methods, it creates a novelgaze following framework based mainly on powerful encoders (relative decoderparameters less than 1%). Our principal insight is that the inter-tokeninteractions within self-attention can be transferred to interactions betweenhumans and scenes. Leveraging this presumption, we formulate a frameworkconsisting of a 4D interaction encoder and a 2D spatial guidance module toextract human-scene interaction information from self-attention maps.Furthermore, our investigation reveals that ViT with self-supervisedpre-training has an enhanced ability to extract correlation information. Manyexperiments have been conducted to demonstrate the performance of the proposedmethod. Our method achieves state-of-the-art (SOTA) performance among allsingle-modality methods (3.4% improvement in the area under curve (AUC) score,5.1% improvement in the average precision (AP)) and very comparable performanceagainst multi-modality methods with 59% number of parameters less.