2 months ago

Changpeng Yang Jinyang Wu Yuchen Liu Shuai Zhang Yang Li Qiliang Liang Hongzhen Wang Shuai Nie Jiaming Xu Runyu Shi

Abstract

Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose CAPO (Curriculum Advantage Policy Optimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

2 months ago

Reinforcement Learning

LLM

Reasoning

Method/Architecture

Changpeng Yang Jinyang Wu Yuchen Liu Shuai Zhang Yang Li Qiliang Liang Hongzhen Wang Shuai Nie Jiaming Xu Runyu Shi

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

2 months ago

Reinforcement Learning

LLM

Reasoning

Method/Architecture

Changpeng Yang Jinyang Wu Yuchen Liu Shuai Zhang Yang Li Qiliang Liang Hongzhen Wang Shuai Nie Jiaming Xu Runyu Shi

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

Changpeng Yang Jinyang Wu Yuchen Liu Shuai Zhang Yang Li Qiliang Liang Hongzhen Wang Shuai Nie Jiaming Xu Runyu Shi2 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

Changpeng Yang Jinyang Wu Yuchen Liu Shuai Zhang Yang Li Qiliang Liang Hongzhen Wang Shuai Nie Jiaming Xu Runyu Shi2 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

Changpeng Yang Jinyang Wu Yuchen Liu Shuai Zhang Yang Li Qiliang Liang Hongzhen Wang Shuai Nie Jiaming Xu Runyu Shi2 more

Abstract

Build AI with AI

HyperAI Newsletters

Changpeng Yang Jinyang Wu Yuchen Liu Shuai Zhang Yang Li Qiliang Liang Hongzhen Wang Shuai Nie Jiaming Xu Runyu Shi

Changpeng Yang Jinyang Wu Yuchen Liu Shuai Zhang Yang Li Qiliang Liang Hongzhen Wang Shuai Nie Jiaming Xu Runyu Shi

Changpeng Yang Jinyang Wu Yuchen Liu Shuai Zhang Yang Li Qiliang Liang Hongzhen Wang Shuai Nie Jiaming Xu Runyu Shi