MiVOLO: Multi-input Transformer for Age and Gender Estimation

Age and gender recognition in the wild is a highly challenging task: apartfrom the variability of conditions, pose complexities, and varying imagequality, there are cases where the face is partially or completely occluded. Wepresent MiVOLO (Multi Input VOLO), a straightforward approach for age andgender estimation using the latest vision transformer. Our method integratesboth tasks into a unified dual input/output model, leveraging not only facialinformation but also person image data. This improves the generalizationability of our model and enables it to deliver satisfactory results even whenthe face is not visible in the image. To evaluate our proposed model, weconduct experiments on four popular benchmarks and achieve state-of-the-artperformance, while demonstrating real-time processing capabilities.Additionally, we introduce a novel benchmark based on images from the OpenImages Dataset. The ground truth annotations for this benchmark have beenmeticulously generated by human annotators, resulting in high accuracy answersdue to the smart aggregation of votes. Furthermore, we compare our model's agerecognition performance with human-level accuracy and demonstrate that itsignificantly outperforms humans across a majority of age ranges. Finally, wegrant public access to our models, along with the code for validation andinference. In addition, we provide extra annotations for used datasets andintroduce our new benchmark.