8 months ago

Abstract

Real-world recognition system often encounters the challenge of unseenlabels. To identify such unseen labels, multi-label zero-shot learning (ML-ZSL)focuses on transferring knowledge by a pre-trained textual label embedding(e.g., GloVe). However, such methods only exploit single-modal knowledge from alanguage model, while ignoring the rich semantic information inherent inimage-text pairs. Instead, recently developed open-vocabulary (OV) basedmethods succeed in exploiting such information of image-text pairs in objectdetection, and achieve impressive performance. Inspired by the success ofOV-based methods, we propose a novel open-vocabulary framework, namedmulti-modal knowledge transfer (MKT), for multi-label classification.Specifically, our method exploits multi-modal knowledge of image-text pairsbased on a vision and language pre-training (VLP) model. To facilitatetransferring the image-text matching ability of VLP model, knowledgedistillation is employed to guarantee the consistency of image and labelembeddings, along with prompt tuning to further update the label embeddings. Tofurther enable the recognition of multiple objects, a simple but effectivetwo-stream module is developed to capture both local and global features.Extensive experimental results show that our method significantly outperformsstate-of-the-art methods on public benchmark datasets. The source code isavailable at https://github.com/sunanhe/MKT.

Source PDF