4 months ago

Yi Lu Jianing Wang Linsen Guo Wei He Hongyin Tang Tao Gui Xuanjing Huang Xuezhi Cao Wei Wang Xunliang Cai

Abstract

Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks, with an increase of 7.5 on AIME2024. These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

4 months ago

Yi Lu Jianing Wang Linsen Guo Wei He Hongyin Tang Tao Gui Xuanjing Huang Xuezhi Cao Wei Wang Xunliang Cai

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

4 months ago

Yi Lu Jianing Wang Linsen Guo Wei He Hongyin Tang Tao Gui Xuanjing Huang Xuezhi Cao Wei Wang Xunliang Cai

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth? | Papers | HyperAI

Command Palette

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

Yi Lu Jianing Wang Linsen Guo Wei He Hongyin Tang Tao Gui Xuanjing Huang Xuezhi Cao Wei Wang Xunliang Cai

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

Yi Lu Jianing Wang Linsen Guo Wei He Hongyin Tang Tao Gui Xuanjing Huang Xuezhi Cao Wei Wang Xunliang Cai

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

Yi Lu Jianing Wang Linsen Guo Wei He Hongyin Tang Tao Gui Xuanjing Huang Xuezhi Cao Wei Wang Xunliang Cai

Abstract

Build AI with AI

HyperAI Newsletters