HyperAIHyperAI

Command Palette

Search for a command to run...

2 months ago

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

TreePO: Bridging the Gap of Policy Optimization and Efficacy and
  Inference Efficiency with Heuristic Tree-based Modeling

Abstract

Recent advancements in aligning large language models via reinforcementlearning have achieved remarkable gains in solving complex reasoning problems,but at the cost of expensive on-policy rollouts and limited exploration ofdiverse reasoning paths. In this work, we introduce TreePO, involving aself-guided rollout algorithm that views sequence generation as atree-structured searching process. Composed of dynamic tree sampling policy andfixed-length segment decoding, TreePO leverages local uncertainty to warrantadditional branches. By amortizing computation across common prefixes andpruning low-value paths early, TreePO essentially reduces the per-updatecompute burden while preserving or enhancing exploration diversity. Keycontributions include: (1) a segment-wise sampling algorithm that alleviatesthe KV cache burden through contiguous segments and spawns new branches alongwith an early-stop mechanism; (2) a tree-based segment-level advantageestimation that considers both global and local proximal policy optimization.and (3) analysis on the effectiveness of probability and quality-driven dynamicdivergence and fallback strategy. We empirically validate the performance gainof TreePO on a set reasoning benchmarks and the efficiency saving of GPU hoursfrom 22\% up to 43\% of the sampling design for the trained models, meanwhileshowing up to 40\% reduction at trajectory-level and 35\% at token-levelsampling compute for the existing models. While offering a free lunch ofinference efficiency, TreePO reveals a practical path toward scaling RL-basedpost-training with fewer samples and less compute. Home page locates athttps://m-a-p.ai/TreePO.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling | Papers | HyperAI