HyperAIHyperAI
3 months ago

CogVLM2: Visual Language Models for Image and Video Understanding

Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang, Xiaotao Gu, Xiaohan Zhang, Guanyu Feng, Da Yin, Zihan Wang, Ji Qi, Xixuan Song, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Yuxiao Dong, Jie Tang
CogVLM2: Visual Language Models for Image and Video Understanding
Abstract

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs inpursuit of enhanced vision-language fusion, efficient higher-resolutionarchitecture, and broader modalities and applications. Here we propose theCogVLM2 family, a new generation of visual language models for image and videounderstanding including CogVLM2, CogVLM2-Video and GLM-4V. As an imageunderstanding model, CogVLM2 inherits the visual expert architecture withimproved training recipes in both pre-training and post-training stages,supporting input resolution up to 1344 times 1344 pixels. As a videounderstanding model, CogVLM2-Video integrates multi-frame input with timestampsand proposes automated temporal grounding data construction. Notably, CogVLM2family has achieved state-of-the-art results on benchmarks like MMBench,MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced inhttps://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4,contributing to the advancement of the field.