2 days ago

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

Yuying Ge, Yixiao Ge, Chen Li, Teng Wang, Junfu Pu, Yizhuo Li, Lu Qiu, Jin Ma, Lisheng Duan, Xinyu Zuo, Jinwen Luo, Weibo Gu, Zexuan Li, Xiaojing Zhang, Yangyu Tao, Han Hu, Di Wang, Ying Shan

View Paper Details View Code

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World
Shorts

Abstract

Real-world user-generated short videos, especially those distributed onplatforms such as WeChat Channel and TikTok, dominate the mobile internet.However, current large multimodal models lack essential temporally-structured,detailed, and in-depth video comprehension capabilities, which are thecornerstone of effective video search and recommendation, as well as emergingvideo applications. Understanding real-world shorts is actually challenging dueto their complex visual elements, high information density in both visuals andaudio, and fast pacing that focuses on emotional expression and viewpointdelivery. This requires advanced reasoning to effectively integrate multimodalinformation, including visual, audio, and text. In this work, we introduceARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textualsignals from raw video inputs end-to-end for structured comprehension. Themodel is capable of multi-granularity timestamped video captioning andsummarization, open-ended video question answering, temporal video grounding,and video reasoning. Leveraging high-quality data from an automated annotationpipeline, our compact 7B-parameter model is trained through a comprehensiveregimen: pre-training, instruction fine-tuning, cold start, reinforcementlearning (RL) post-training, and final instruction fine-tuning. Quantitativeevaluations on our introduced benchmark ShortVid-Bench and qualitativecomparisons demonstrate its strong performance in real-world videocomprehension, and it supports zero-shot or fine-tuning with a few samples fordiverse downstream applications. The real-world production deployment of ourmodel has yielded tangible and measurable improvements in user engagement andsatisfaction, a success supported by its remarkable efficiency, with stresstests indicating an inference time of just 10 seconds for a one-minute video onH20 GPU.