HyperAI

Video Narration Captioning

Video Narration Captioning is a sub-task in the field of computer vision that aims to predict the narration captions for each shot in a multi-shot video. This task introduces Automatic Speech Recognition (ASR) text as additional input, utilizing the same model architecture as single-shot video captioning, but with the prediction target being the narration captions. Video narration captions not only provide background knowledge but also reflect the commentator's perspective, offering significant value in understanding video content.