UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

Temporal Action Detection (TAD) focuses on detecting pre-defined actions,while Moment Retrieval (MR) aims to identify the events described by open-endednatural language within untrimmed videos. Despite that they focus on differentevents, we observe they have a significant connection. For instance, mostdescriptions in MR involve multiple actions from TAD. In this paper, we aim toinvestigate the potential synergy between TAD and MR. Firstly, we propose aunified architecture, termed Unified Moment Detection (UniMD), for both TAD andMR. It transforms the inputs of the two tasks, namely actions for TAD or eventsfor MR, into a common embedding space, and utilizes two novel query-dependentdecoders to generate a uniform output of classification score and temporalsegments. Secondly, we explore the efficacy of two task fusion learningapproaches, pre-training and co-training, in order to enhance the mutualbenefits between TAD and MR. Extensive experiments demonstrate that theproposed task fusion learning scheme enables the two tasks to help each otherand outperform the separately trained counterparts. Impressively, UniMDachieves state-of-the-art results on three paired datasets Ego4D, Charades-STA,and ActivityNet. Our code is available at https://github.com/yingsen1/UniMD.