2 months ago

OmniVec: Learning robust representations with cross modal sharing

Srivastava, Siddharth ; Sharma, Gaurav

Abstract

Majority of research in learning based methods has been towards designing andtraining networks for specific tasks. However, many of the learning basedtasks, across modalities, share commonalities and could be potentially tackledin a joint framework. We present an approach in such direction, to learnmultiple tasks, in multiple modalities, with a unified architecture. Theproposed network is composed of task specific encoders, a common trunk in themiddle, followed by task specific prediction heads. We first pre-train it byself-supervised masked training, followed by sequential training for thedifferent tasks. We train the network on all major modalities, e.g.\ visual,audio, text and 3D, and report results on $22$ diverse and challenging publicbenchmarks. We demonstrate empirically that, using a joint network to trainacross modalities leads to meaningful information sharing and this allows us toachieve state-of-the-art results on most of the benchmarks. We also showgeneralization of the trained network on cross-modal tasks as well as unseendatasets and tasks.