SAIL-VL2 Technical Report

We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM)for comprehensive multimodal understanding and reasoning. As the successor toSAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8Bparameter scales across diverse image and video benchmarks, demonstratingstrong capabilities from fine-grained perception to complex reasoning. Threecore innovations drive its effectiveness. First, a large-scale data curationpipeline with scoring and filtering strategies enhances both quality anddistribution across captioning, OCR, QA, and video data, improving trainingefficiency. Second, a progressive training framework begins with a powerfulpre-trained vision encoder (SAIL-ViT), advances through multimodalpre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm thatsystematically strengthens model capabilities. Third, architectural advancesextend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs.With these contributions, SAIL-VL2 demonstrates competitive performance across106 datasets and achieves state-of-the-art results on challenging reasoningbenchmarks such as MMMU and MathVista. Furthermore, on the OpenCompassleaderboard, SAIL-VL2-2B ranks first among officially released open-sourcemodels under the 4B parameter scale, while serving as an efficient andextensible foundation for the open-source multimodal community.