8 months ago

Video Captioning

Multimodal Representation

Minkuk Kim Hyeon Bae Kim Jinyoung Moon Jinwoo Choi Seong Tae Kim

Abstract

There has been significant attention to the research on dense videocaptioning, which aims to automatically localize and caption all events withinuntrimmed video. Several studies introduce methods by designing dense videocaptioning as a multitasking problem of event localization and event captioningto consider inter-task relations. However, addressing both tasks using onlyvisual input is challenging due to the lack of semantic content. In this study,we address this by proposing a novel framework inspired by the cognitiveinformation processing of humans. Our model utilizes external memory toincorporate prior knowledge. The memory retrieval method is proposed withcross-modal video-to-text matching. To effectively incorporate retrieved textfeatures, the versatile encoder and the decoder with visual and textualcross-attention modules are designed. Comparative experiments have beenconducted to show the effectiveness of the proposed method on ActivityNetCaptions and YouCook2 datasets. Experimental results show promising performanceof our model without extensive pretraining from a large video dataset.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Video Captioning

Multimodal Representation

Minkuk Kim Hyeon Bae Kim Jinyoung Moon Jinwoo Choi Seong Tae Kim

Abstract

There has been significant attention to the research on dense videocaptioning, which aims to automatically localize and caption all events withinuntrimmed video. Several studies introduce methods by designing dense videocaptioning as a multitasking problem of event localization and event captioningto consider inter-task relations. However, addressing both tasks using onlyvisual input is challenging due to the lack of semantic content. In this study,we address this by proposing a novel framework inspired by the cognitiveinformation processing of humans. Our model utilizes external memory toincorporate prior knowledge. The memory retrieval method is proposed withcross-modal video-to-text matching. To effectively incorporate retrieved textfeatures, the versatile encoder and the decoder with visual and textualcross-attention modules are designed. Comparative experiments have beenconducted to show the effectiveness of the proposed method on ActivityNetCaptions and YouCook2 datasets. Experimental results show promising performanceof our model without extensive pretraining from a large video dataset.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp