Mixer is more than just a model

Recently, MLP structures have regained popularity, with MLP-Mixer standingout as a prominent example. In the field of computer vision, MLP-Mixer is notedfor its ability to extract data information from both channel and tokenperspectives, effectively acting as a fusion of channel and token information.Indeed, Mixer represents a paradigm for information extraction that amalgamateschannel and token information. The essence of Mixer lies in its ability toblend information from diverse perspectives, epitomizing the true concept of"mixing" in the realm of neural network architectures. Beyond channel and tokenconsiderations, it is possible to create more tailored mixers from variousperspectives to better suit specific task requirements. This study focuses onthe domain of audio recognition, introducing a novel model named AudioSpectrogram Mixer with Roll-Time and Hermit FFT (ASM-RH) that incorporatesinsights from both time and frequency domains. Experimental results demonstratethat ASM-RH is particularly well-suited for audio data and yields promisingoutcomes across multiple classification tasks. The models and optimal weightsfiles will be published.