ATST: Audio Representation Learning with Teacher-Student Transformer

Self-supervised learning (SSL) learns knowledge from a large amount ofunlabeled data, and then transfers the knowledge to a specific problem with alimited number of labeled data. SSL has achieved promising results in variousdomains. This work addresses the problem of segment-level general audio SSL,and proposes a new transformer-based teacher-student SSL model, named ATST. Atransformer encoder is developed on a recently emerged teacher-student baselinescheme, which largely improves the modeling capability of pre-training. Inaddition, a new strategy for positive pair creation is designed to fullyleverage the capability of transformer. Extensive experiments have beenconducted, and the proposed model achieves the new state-of-the-art results onalmost all of the downstream tasks.