Disentangled Variational Representation for Heterogeneous Face Recognition

Visible (VIS) to near infrared (NIR) face matching is a challenging problemdue to the significant domain discrepancy between the domains and a lack ofsufficient data for training cross-modal matching algorithms. Existingapproaches attempt to tackle this problem by either synthesizing visible facesfrom NIR faces, extracting domain-invariant features from these modalities, orprojecting heterogeneous data onto a common latent space for cross-modalmatching. In this paper, we take a different approach in which we make use ofthe Disentangled Variational Representation (DVR) for cross-modal matching.First, we model a face representation with an intrinsic identity informationand its within-person variations. By exploring the disentangled latent variablespace, a variational lower bound is employed to optimize the approximateposterior for NIR and VIS representations. Second, aiming at obtaining morecompact and discriminative disentangled latent space, we impose a minimizationof the identity information for the same subject and a relaxed correlationalignment constraint between the NIR and VIS modality variations. Analternative optimization scheme is proposed for the disentangled variationalrepresentation part and the heterogeneous face recognition network part. Themutual promotion between these two parts effectively reduces the NIR and VISdomain discrepancy and alleviates over-fitting. Extensive experiments on threechallenging NIR-VIS heterogeneous face recognition databases demonstrate thatthe proposed method achieves significant improvements over the state-of-the-artmethods.