HyperAIHyperAI

Command Palette

Search for a command to run...

Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks

Xian Li Nian Shao Xiaofei Li*

Abstract

Self-supervised learning (SSL) has emerged as a popular approach for learningaudio representations. One goal of audio self-supervised pre-training is totransfer knowledge to downstream audio tasks, generally including clip-leveland frame-level tasks. While frame-level tasks are important for fine-grainedacoustic scene/event understanding, prior studies primarily evaluate onclip-level downstream tasks. In order to tackle both clip-level and frame-leveltasks, this paper proposes Audio Teacher-Student Transformer (ATST), with aclip-level version (named ATST-Clip) and a frame-level version (namedATST-Frame), responsible for learning clip-level and frame-levelrepresentations, respectively. Both methods use a Transformer encoder and ateacher-student training scheme. We have carefully designed the view creationstrategy for ATST-Clip and ATST-Frame. Specifically, ATST-Clip usessegment-wise data augmentations, and ATST-Frame integrates frame-wise dataaugmentations and masking. Experimental results show that our ATST-Frame modelobtains state-of-the-art (SOTA) performances on most of the clip-level andframe-level downstream tasks. Especially, it outperforms other models by alarge margin on the frame-level sound event detection task. In addition, theperformance can be further improved by combining the two models throughknowledge distillation. Our code is available online.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp