24 days ago

SAIL-VL2 Technical Report

Weijie Yin, Yongjie Ye, Fangxun Shu, Yue Liao, Zijian Kang, Hongyuan Dong, Haiyang Yu, Dingkang Yang, Jiacong Wang, Han Wang, Wenzhuo Liu, Xiao Liang, Shuicheng Yan, Chao Feng

View Paper Details

Abstract

We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM)for comprehensive multimodal understanding and reasoning. As the successor toSAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8Bparameter scales across diverse image and video benchmarks, demonstratingstrong capabilities from fine-grained perception to complex reasoning. Threecore innovations drive its effectiveness. First, a large-scale data curationpipeline with scoring and filtering strategies enhances both quality anddistribution across captioning, OCR, QA, and video data, improving trainingefficiency. Second, a progressive training framework begins with a powerfulpre-trained vision encoder (SAIL-ViT), advances through multimodalpre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm thatsystematically strengthens model capabilities. Third, architectural advancesextend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs.With these contributions, SAIL-VL2 demonstrates competitive performance across106 datasets and achieves state-of-the-art results on challenging reasoningbenchmarks such as MMMU and MathVista. Furthermore, on the OpenCompassleaderboard, SAIL-VL2-2B ranks first among officially released open-sourcemodels under the 4B parameter scale, while serving as an efficient andextensible foundation for the open-source multimodal community.