3 months ago

Adversarial Multimodal Domain Transfer for Video-Level Sentiment Analysis

{Wang Yanan; Wu Jianming; Furumai Kazuaki; Wada Shinya; Kurihara Satoshi}

Abstract

Video-level sentiment analysis is a challenging task and requires systems to obtain discriminative multimodal representations that can capture difference in sentiments across various modalities.However, due to diverse distributions of various modalities and the unified multimodal labels are not alwaysadaptable to unimodal learning, the distance difference between unimodal representations increases, andprevents systems from learning discriminative multimodal representations. In this paper, to obtain morediscriminative multimodal representations that can further improve systems’ performance, we propose aVAE-based adversarial multimodal domain transfer (VAE-AMDT) and jointly train it with a multi-attentionmodule to reduce the distance difference between unimodal representations. We first perform variationalautoencoder (VAE) to make visual, linguistic and acoustic representations follow a common distribution,and then introduce adversarial training to transfer all unimodal representations to a joint embedding space.As a result, we fuse various modalities on this joint embedding space via the multi-attention module,which consists of self-attention, cross-attention and triple-attention for highlighting important sentimentalrepresentations over time and modality. Our method improves F1-score of the state-of-the-art by 3.6%on MOSI and 2.9% on MOSEI datasets, and prove its efficacy in obtaining discriminative multimodalrepresentations for video-level sentiment analysis.