HyperAIHyperAI
2 months ago

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval

Kim, Minkuk ; Kim, Hyeon Bae ; Moon, Jinyoung ; Choi, Jinwoo ; Kim, Seong Tae
Do You Remember? Dense Video Captioning with Cross-Modal Memory
  Retrieval
Abstract

There has been significant attention to the research on dense videocaptioning, which aims to automatically localize and caption all events withinuntrimmed video. Several studies introduce methods by designing dense videocaptioning as a multitasking problem of event localization and event captioningto consider inter-task relations. However, addressing both tasks using onlyvisual input is challenging due to the lack of semantic content. In this study,we address this by proposing a novel framework inspired by the cognitiveinformation processing of humans. Our model utilizes external memory toincorporate prior knowledge. The memory retrieval method is proposed withcross-modal video-to-text matching. To effectively incorporate retrieved textfeatures, the versatile encoder and the decoder with visual and textualcross-attention modules are designed. Comparative experiments have beenconducted to show the effectiveness of the proposed method on ActivityNetCaptions and YouCook2 datasets. Experimental results show promising performanceof our model without extensive pretraining from a large video dataset.

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval | Latest Papers | HyperAI