ENTER: Event Based Interpretable Reasoning for VideoQA

In this paper, we present ENTER, an interpretable Video Question Answering(VideoQA) system based on event graphs. Event graphs convert videos intographical representations, where video events form the nodes and event-eventrelationships (temporal/causal/hierarchical) form the edges. This structuredrepresentation offers many benefits: 1) Interpretable VideoQA via generatedcode that parses event-graph; 2) Incorporation of contextual visual informationin the reasoning process (code generation) via event graphs; 3) Robust VideoQAvia Hierarchical Iterative Update of the event graphs. Existing interpretableVideoQA systems are often top-down, disregarding low-level visual informationin the reasoning plan generation, and are brittle. While bottom-up approachesproduce responses from visual data, they lack interpretability. Experimentalresults on NExT-QA, IntentQA, and EgoSchema demonstrate that not only does ourmethod outperform existing top-down approaches while obtaining competitiveperformance against bottom-up approaches, but more importantly, offers superiorinterpretability and explainability in the reasoning process.