Learning from Label Relationships in Human Affect

Human affect and mental state estimation in an automated manner, face anumber of difficulties, including learning from labels with poor or no temporalresolution, learning from few datasets with little data (often due toconfidentiality constraints) and, (very) long, in-the-wild videos. For thesereasons, deep learning methodologies tend to overfit, that is, arrive at latentrepresentations with poor generalisation performance on the final regressiontask. To overcome this, in this work, we introduce two complementarycontributions. First, we introduce a novel relational loss for multilabelregression and ordinal problems that regularises learning and leads to bettergeneralisation. The proposed loss uses label vector inter-relationalinformation to learn better latent representations by aligning batch labeldistances to the distances in the latent feature space. Second, we utilise atwo-stage attention architecture that estimates a target for each clip by usingfeatures from the neighbouring clips as temporal context. We evaluate theproposed methodology on both continuous affect and schizophrenia severityestimation problems, as there are methodological and contextual parallelsbetween the two. Experimental results demonstrate that the proposed methodologyoutperforms all baselines. In the domain of schizophrenia, the proposedmethodology outperforms previous state-of-the-art by a large margin, achievinga PCC of up to 78% performance close to that of human experts (85%) and muchhigher than previous works (uplift of up to 40%). In the case of affectrecognition, we outperform previous vision-based methods in terms of CCC onboth the OMG and the AMIGOS datasets. Specifically for AMIGOS, we outperformprevious SoTA CCC for both arousal and valence by 9% and 13% respectively, andin the OMG dataset we outperform previous vision works by up to 5% for botharousal and valence.