ML-Decoder: Scalable and Versatile Classification Head

In this paper, we introduce ML-Decoder, a new attention-based classificationhead. ML-Decoder predicts the existence of class labels via queries, andenables better utilization of spatial data compared to global average pooling.By redesigning the decoder architecture, and using a novel group-decodingscheme, ML-Decoder is highly efficient, and can scale well to thousands ofclasses. Compared to using a larger backbone, ML-Decoder consistently providesa better speed-accuracy trade-off. ML-Decoder is also versatile - it can beused as a drop-in replacement for various classification heads, and generalizeto unseen classes when operated with word queries. Novel query augmentationsfurther improve its generalization ability. Using ML-Decoder, we achievestate-of-the-art results on several classification tasks: on MS-COCOmulti-label, we reach 91.4% mAP; on NUS-WIDE zero-shot, we reach 31.1% ZSL mAP;and on ImageNet single-label, we reach with vanilla ResNet50 backbone a new topscore of 80.7%, without extra data or distillation. Public code is availableat: https://github.com/Alibaba-MIIL/ML_Decoder