SegFace: Face Segmentation of Long-Tail Classes

Face parsing refers to the semantic segmentation of human faces into keyfacial regions such as eyes, nose, hair, etc. It serves as a prerequisite forvarious advanced applications, including face editing, face swapping, andfacial makeup, which often require segmentation masks for classes likeeyeglasses, hats, earrings, and necklaces. These infrequently occurring classesare called long-tail classes, which are overshadowed by more frequentlyoccurring classes known as head classes. Existing methods, primarily CNN-based,tend to be dominated by head classes during training, resulting in suboptimalrepresentation for long-tail classes. Previous works have largely overlookedthe problem of poor segmentation performance of long-tail classes. To addressthis issue, we propose SegFace, a simple and efficient approach that uses alightweight transformer-based model which utilizes learnable class-specifictokens. The transformer decoder leverages class-specific tokens, allowing eachtoken to focus on its corresponding class, thereby enabling independentmodeling of each class. The proposed approach improves the performance oflong-tail classes, thereby boosting overall performance. To the best of ourknowledge, SegFace is the first work to employ transformer models for faceparsing. Moreover, our approach can be adapted for low-compute edge devices,achieving 95.96 FPS. We conduct extensive experiments demonstrating thatSegFace significantly outperforms previous state-of-the-art models, achieving amean F1 score of 88.96 (+2.82) on the CelebAMask-HQ dataset and 93.03 (+0.65)on the LaPa dataset. Code: https://github.com/Kartik-3004/SegFace