4 days ago

Kwai Keye-VL Technical Report

Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Hao Peng, Haojie Ding, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Jin Ouyang, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yang Zhou, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zhenhua Wu, Zhenyu Li, Zhixin Ling, Ziming Li, Dehua Ma, Di Xu, Haixuan Gao, Hang Li, Jiawei Guo, Jing Wang, Lejian Ren, Muhao Wei, Qianqian Wang, Qigen Hu, Shiyao Wang, Tao Yu, Xinchen Luo, Yan Li, Yiming Liang, Yuhang Hu, Zeyi Lu, Zhuoran Yang, Zixing Zhang

View Paper Details View Code

Abstract

While Multimodal Large Language Models (MLLMs) demonstrate remarkablecapabilities on static images, they often fall short in comprehending dynamic,information-dense short-form videos, a dominant medium in today's digitallandscape. To bridge this gap, we introduce Kwai Keye-VL, an8-billion-parameter multimodal foundation model engineered for leading-edgeperformance in short-video understanding while maintaining robustgeneral-purpose vision-language abilities. The development of Keye-VL rests ontwo core pillars: a massive, high-quality dataset exceeding 600 billion tokenswith a strong emphasis on video, and an innovative training recipe. This recipefeatures a four-stage pre-training process for solid vision-language alignment,followed by a meticulous two-phase post-training process. The firstpost-training stage enhances foundational capabilities like instructionfollowing, while the second phase focuses on stimulating advanced reasoning. Inthis second phase, a key innovation is our five-mode cold-start'' datamixture, which includesthinking'', non-thinking'',auto-think'', ``thinkwith image'', and high-quality video data. This mixture teaches the model todecide when and how to reason. Subsequent reinforcement learning (RL) andalignment steps further enhance these reasoning capabilities and correctabnormal model behaviors, such as repetitive outputs. To validate our approach,we conduct extensive evaluations, showing that Keye-VL achievesstate-of-the-art results on public video benchmarks and remains highlycompetitive on general image-based tasks (Figure 1). Furthermore, we developand release the KC-MMBench, a new benchmark tailored for real-worldshort-video scenarios, where Keye-VL shows a significant advantage.