Deep Learning Face Attributes in the Wild

Predicting face attributes in the wild is challenging due to complex facevariations. We propose a novel deep learning framework for attribute predictionin the wild. It cascades two CNNs, LNet and ANet, which are fine-tuned jointlywith attribute tags, but pre-trained differently. LNet is pre-trained bymassive general object categories for face localization, while ANet ispre-trained by massive face identities for attribute prediction. This frameworknot only outperforms the state-of-the-art with a large margin, but also revealsvaluable facts on learning face representation. (1) It shows how the performances of face localization (LNet) and attributeprediction (ANet) can be improved by different pre-training strategies. (2) It reveals that although the filters of LNet are fine-tuned only withimage-level attribute tags, their response maps over entire images have strongindication of face locations. This fact enables training LNet for facelocalization with only image-level annotations, but without face bounding boxesor landmarks, which are required by all attribute recognition works. (3) It also demonstrates that the high-level hidden neurons of ANetautomatically discover semantic concepts after pre-training with massive faceidentities, and such concepts are significantly enriched after fine-tuning withattribute tags. Each attribute can be well explained with a sparse linearcombination of these concepts.