7 months ago

Visual Question Answering

Method/Architecture

Yifan Shen Yuanzhe Liu Jingyuan Zhu Xu Cao Xiaofeng Zhang Yixiao He Wenming Ye James Matthew Rehg Ismini Lourentzou

Abstract

Current Vision-Language Models (VLMs) struggle with fine-grained spatialreasoning, particularly when multi-step logic and precise spatial alignment arerequired. In this work, we introduce SpatialReasoner-R1, a vision-languagereasoning model designed to address these limitations. To constructhigh-quality supervision for spatial reasoning, we design a Multi-Model MonteCarlo Tree Search (M3CTS) method that generates diverse, logically consistentLong Chain-of-Thought (LongCoT) reasoning trajectories. In addition, we proposefine-grained Direct Preference Optimization (fDPO), which introducessegment-specific preference granularity for descriptive grounding and logicalreasoning, guided by a spatial reward mechanism that evaluates candidateresponses based on visual consistency, spatial grounding, and logicalcoherence. Experimental results demonstrate that fDPO achieves an averageimprovement of 4.1% over standard DPO across spatial quality tasks, and a 9.0%gain in spatial quantity tasks. SpatialReasoner-R1, trained with fDPO, sets anew SoTA on SPATIALRGPT-Bench, outperforming the strongest baseline by 9.8% inaverage accuracy, while maintaining competitive performance on generalvision-language tasks.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

7 months ago

Visual Question Answering

Method/Architecture

Yifan Shen Yuanzhe Liu Jingyuan Zhu Xu Cao Xiaofeng Zhang Yixiao He Wenming Ye James Matthew Rehg Ismini Lourentzou

Abstract

Current Vision-Language Models (VLMs) struggle with fine-grained spatialreasoning, particularly when multi-step logic and precise spatial alignment arerequired. In this work, we introduce SpatialReasoner-R1, a vision-languagereasoning model designed to address these limitations. To constructhigh-quality supervision for spatial reasoning, we design a Multi-Model MonteCarlo Tree Search (M3CTS) method that generates diverse, logically consistentLong Chain-of-Thought (LongCoT) reasoning trajectories. In addition, we proposefine-grained Direct Preference Optimization (fDPO), which introducessegment-specific preference granularity for descriptive grounding and logicalreasoning, guided by a spatial reward mechanism that evaluates candidateresponses based on visual consistency, spatial grounding, and logicalcoherence. Experimental results demonstrate that fDPO achieves an averageimprovement of 4.1% over standard DPO across spatial quality tasks, and a 9.0%gain in spatial quantity tasks. SpatialReasoner-R1, trained with fDPO, sets anew SoTA on SPATIALRGPT-Bench, outperforming the strongest baseline by 9.8% inaverage accuracy, while maintaining competitive performance on generalvision-language tasks.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp