AUFormer: Vision Transformers are Parameter-Efficient Facial Action Unit Detectors

Facial Action Units (AU) is a vital concept in the realm of affectivecomputing, and AU detection has always been a hot research topic. Existingmethods suffer from overfitting issues due to the utilization of a large numberof learnable parameters on scarce AU-annotated datasets or heavy reliance onsubstantial additional relevant data. Parameter-Efficient Transfer Learning(PETL) provides a promising paradigm to address these challenges, whereas itsexisting methods lack design for AU characteristics. Therefore, we innovativelyinvestigate PETL paradigm to AU detection, introducing AUFormer and proposing anovel Mixture-of-Knowledge Expert (MoKE) collaboration mechanism. An individualMoKE specific to a certain AU with minimal learnable parameters firstintegrates personalized multi-scale and correlation knowledge. Then the MoKEcollaborates with other MoKEs in the expert group to obtain aggregatedinformation and inject it into the frozen Vision Transformer (ViT) to achieveparameter-efficient AU detection. Additionally, we design a Margin-truncatedDifficulty-aware Weighted Asymmetric Loss (MDWA-Loss), which can encourage themodel to focus more on activated AUs, differentiate the difficulty ofunactivated AUs, and discard potential mislabeled samples. Extensiveexperiments from various perspectives, including within-domain, cross-domain,data efficiency, and micro-expression domain, demonstrate AUFormer'sstate-of-the-art performance and robust generalization abilities withoutrelying on additional relevant data. The code for AUFormer is available athttps://github.com/yuankaishen2001/AUFormer.