3 months ago

Rui Yang Ziyu Zhu Yanwei Li Jingjia Huang Shen Yan Siyuan Zhou Zhe Liu Xiangtai Li Shuangye Li Wenqian Wang

Abstract

Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning. We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos. Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space. In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities. Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including 34.8% on MMSI-Bench and 61.2% on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

3 months ago

Multimodal

Multimodal Representation

Rui Yang Ziyu Zhu Yanwei Li Jingjia Huang Shen Yan Siyuan Zhou Zhe Liu Xiangtai Li Shuangye Li Wenqian Wang

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

3 months ago

Multimodal

Multimodal Representation

Rui Yang Ziyu Zhu Yanwei Li Jingjia Huang Shen Yan Siyuan Zhou Zhe Liu Xiangtai Li Shuangye Li Wenqian Wang

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Visual Spatial Tuning

Rui Yang Ziyu Zhu Yanwei Li Jingjia Huang Shen Yan Siyuan Zhou Zhe Liu Xiangtai Li Shuangye Li Wenqian Wang2 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Visual Spatial Tuning

Rui Yang Ziyu Zhu Yanwei Li Jingjia Huang Shen Yan Siyuan Zhou Zhe Liu Xiangtai Li Shuangye Li Wenqian Wang2 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Visual Spatial Tuning

Rui Yang Ziyu Zhu Yanwei Li Jingjia Huang Shen Yan Siyuan Zhou Zhe Liu Xiangtai Li Shuangye Li Wenqian Wang2 more

Abstract

Build AI with AI

HyperAI Newsletters

Rui Yang Ziyu Zhu Yanwei Li Jingjia Huang Shen Yan Siyuan Zhou Zhe Liu Xiangtai Li Shuangye Li Wenqian Wang

Rui Yang Ziyu Zhu Yanwei Li Jingjia Huang Shen Yan Siyuan Zhou Zhe Liu Xiangtai Li Shuangye Li Wenqian Wang

Rui Yang Ziyu Zhu Yanwei Li Jingjia Huang Shen Yan Siyuan Zhou Zhe Liu Xiangtai Li Shuangye Li Wenqian Wang