CogVLM2: Visual Language Models for Image and Video Understanding

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs inpursuit of enhanced vision-language fusion, efficient higher-resolutionarchitecture, and broader modalities and applications. Here we propose theCogVLM2 family, a new generation of visual language models for image and videounderstanding including CogVLM2, CogVLM2-Video and GLM-4V. As an imageunderstanding model, CogVLM2 inherits the visual expert architecture withimproved training recipes in both pre-training and post-training stages,supporting input resolution up to 1344 times 1344 pixels. As a videounderstanding model, CogVLM2-Video integrates multi-frame input with timestampsand proposes automated temporal grounding data construction. Notably, CogVLM2family has achieved state-of-the-art results on benchmarks like MMBench,MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced inhttps://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4,contributing to the advancement of the field.