HyperAIHyperAI
17 days ago

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin
Qwen2.5-VL Technical Report
Abstract

We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-languageseries, which demonstrates significant advancements in both foundationalcapabilities and innovative functionalities. Qwen2.5-VL achieves a major leapforward in understanding and interacting with the world through enhanced visualrecognition, precise object localization, robust document parsing, andlong-video comprehension. A standout feature of Qwen2.5-VL is its ability tolocalize objects using bounding boxes or points accurately. It provides robuststructured data extraction from invoices, forms, and tables, as well asdetailed analysis of charts, diagrams, and layouts. To handle complex inputs,Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding,enabling it to process images of varying sizes and videos of extended durations(up to hours) with second-level event localization. This allows the model tonatively perceive spatial scales and temporal dynamics without relying ontraditional normalization techniques. By training a native dynamic-resolutionVision Transformer (ViT) from scratch and incorporating Window Attention, wereduce computational overhead while maintaining native resolution. As a result,Qwen2.5-VL excels not only in static image and document understanding but alsoas an interactive visual agent capable of reasoning, tool usage, and taskexecution in real-world scenarios such as operating computers and mobiledevices. Qwen2.5-VL is available in three sizes, addressing diverse use casesfrom edge AI to high-performance computing. The flagship Qwen2.5-VL-72B modelmatches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularlyexcelling in document and diagram understanding. Additionally, Qwen2.5-VLmaintains robust linguistic performance, preserving the core languagecompetencies of the Qwen2.5 LLM.

Qwen2.5-VL Technical Report | Latest Papers | HyperAI