Faithful Multimodal Explanation for Visual Question Answering

AI systems' ability to explain their reasoning is critical to their utilityand trustworthiness. Deep neural networks have enabled significant progress onmany challenging problems such as visual question answering (VQA). However,most of them are opaque black boxes with limited explanatory capability. Thispaper presents a novel approach to developing a high-performing VQA system thatcan elucidate its answers with integrated textual and visual explanations thatfaithfully reflect important aspects of its underlying reasoning whilecapturing the style of comprehensible human explanations. Extensiveexperimental evaluation demonstrates the advantages of this approach comparedto competing methods with both automatic evaluation metrics and humanevaluation metrics.