Paper Roundup | Latest Advances in Large-Scale Reinforcement Learning: Microsoft, Google, Stanford, Renmin University, Xiaohongshu, and Others Release Major Achievements in Credit Allocation, Complex Reasoning, and Agent Reinforcement Learning

2 months ago

Looking at the current development of reinforcement learning, whether it's improving credit allocation capabilities in long-chain inference, enhancing the model's autonomous exploration in complex environments, or building intelligent agent systems with long-term planning and feedback learning capabilities, their core goals all point in the same direction—Breaking through the limitations of sparse rewards and static supervision,It empowers the model to continuously learn and evolve through interaction.

Reinforcement learning is essentially a method that allows an intelligent agent to continuously optimize its behavioral strategies through a closed loop of "perception-decision-execution-feedback". Unlike traditional supervised learning, which relies on a fixed data distribution, reinforcement learning emphasizes the model's ability to learn through trial and error in environmental interactions, enabling it to gradually form a decision-making mechanism that maximizes long-term benefits in dynamic tasks.In short, reinforcement learning is driving artificial intelligence from "being able to answer questions" to "being able to act autonomously," completing a significant leap from "passive generation" to "active intelligence."

This week,HyperAI has selected 6 of the latest research papers in the field of large-model reinforcement learning for you.The team behind it includes top universities such as Stanford University and Renmin University of China, as well as tech giants like Microsoft, Google, Kuaishou, and Xiaohongshu. Their related papers provide highly inspiring new solutions for building next-generation large-scale models with strong reasoning and self-learning capabilities. Let's learn together! ⬇️

In addition, to allow more users to understand the latest developments in the field of artificial intelligence in academia,HyperAI's official website now features a "Latest Papers" section, allowing users to stay up-to-date with cutting-edge AI research.

Latest AI Papers:https://go.hyper.ai/hzChC

This week's paper recommendation

1 ECHO

Paper title:

ECHO: Terminal Agents Learn World Models for Free

Terminal agent interactions generate massive amounts of environmental feedback, but conventional reinforcement learning only uses sparse rewards to update action labels, severely wasting observation data. This research proposes the ECHO method, which, while preserving action loss, additionally calculates cross-entropy prediction loss for environmental feedback labels. This mechanism does not increase forward propagation overhead, enabling the policy to synchronously predict the terminal's responses to instructions during training, essentially learning the world model for free.

Experimental results show that the method doubles the first-response accuracy on the terminal control benchmark, significantly enhances the ability to predict unseen terminal dynamics, greatly reduces reliance on expert demonstrations, and can even achieve self-evolution without external verification.

Paper and detailed interpretation:https://go.hyper.ai/qma4O

ECHO mechanism: Transforms terminal feedback into supervisory signals for agent reinforcement learning.

2 DelTA

Paper title:

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

Reinforcement learning based on verifiable rewards often faces the challenge of overly coarse credit allocation granularity. Regular updates are easily dominated by high-frequency shared patterns such as typesetting, failing to effectively identify the key inference markers that truly generate high returns. To address this issue, this research proposes DelTA, which reweights the self-normalized objective function by calculating unique coefficients. This mechanism accurately amplifies the gradient directions of markers unique to both positive and negative reward sides, strongly suppressing shared, weakly discriminative directions, and significantly improving the contrast of gradient updates. In mathematical inference and code generation evaluations, this method comprehensively outperforms the best baselines of its scale and demonstrates excellent generalization ability across different architectures.

Paper and detailed interpretation:https://go.hyper.ai/IdI42

3 GoLongRL

Paper title:

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

Long-context reinforcement learning is often limited by homogeneous retrieval training data, and conventional algorithms are prone to distorted advantage estimation due to scale and difficulty differences when dealing with mixed rewards across multiple tasks. This research proposes the capability-oriented GoLongRL scheme, pioneering an open-source dataset covering nine core capabilities and customized rewards. Addressing optimization challenges, a TMN-Reweight mechanism is designed, utilizing task-level normalization to align different reward scales and combining difficulty-adaptive weights to focus on high-value, difficult samples. Evaluations show that this scheme comprehensively outperforms existing leading models on multiple long-text benchmarks and effectively avoids the decline in general reasoning and memory capabilities.

Paper and detailed interpretation:https://go.hyper.ai/omy5E

**Overview of the Four-Stage Capability-Oriented RLVR Dataset Construction Process**

The authors constructed a dataset containing 22,965 samples, covering nine capability-oriented tasks, with context lengths ranging from 0.1K to 256K tokens.

4 AntiSD

Paper title:

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

Conventional self-distillation in mathematical reasoning tasks easily leads models to "take shortcuts," over-relying on known answers and suppressing the thought process that truly drives multi-step searches. To address this issue, this research proposes the Anti-Self-Distillation (AntiSD) method. Instead of passively narrowing the gap between teacher and student models, it maximizes the JS divergence to reverse the gradient signal, specifically rewarding exploratory thought markers, and supplements this with an entropy-based gating mechanism to maintain training stability. In tests on multiple large models with varying parameter scales, this method requires only one-fifth to one-half of the baseline training steps to achieve the target, while improving the final accuracy by up to 11.5 percentage points on multiple mathematical reasoning benchmarks.

Paper and detailed interpretation:https://go.hyper.ai/Vax3f

5 RubricEM

Paper title:

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

Long-term, in-depth research tasks often lack objective rewards, and conventional reinforcement learning provides coarse feedback that fails to accumulate effective experience. This research proposes the RubricEM framework, innovatively using a "rating scale" as its core interface. The model breaks down long trajectories into planning, retrieval, review, and response stages based on a self-built scale, thereby achieving fine-grained credit allocation. Simultaneously, the framework asynchronously trains meta-policies, extracting historical interactions into reusable reflective memories. In multiple long-term research evaluations, this 8B model surpasses numerous open-source solutions and approaches top-tier closed-source systems, achieving efficient long-context learning and excellent cross-task generalization with minimal training steps.

Paper and detailed interpretation:https://go.hyper.ai/xSVTh

Detailed reinforcement learning training process of RubricEM

Dataset composition and source: The research team constructed a supervised fine-tuning dataset containing approximately 11,000 samples. The data source is agent trajectories generated by the Gemini teacher model and adapted for Qwen3.

6 Poly-EPO

Paper title:

Poly-EPO: Training Exploratory Reasoning Models

Post-training of large-scale reinforcement learning models often leads to a collapse in generative diversity, hindering the exploration of new inference paths and the expansion of computation during testing. To address collaborative exploration and utilization, this study proposes the Poly-EPO algorithm based on ensemble reinforcement learning. This method breaks away from the traditional approach of evaluating individual responses in isolation, multiplying the average reward of a set of responses by the diversity score of the inference policy as a joint optimization objective, thus natively embedding signals encouraging diverse exploration into the advantage function. In mathematical reasoning evaluations, this algorithm successfully avoids policy homogenization, achieving a pass@k coverage improvement of up to 20%, and demonstrating stronger expansion potential under majority voting mechanisms.

Paper and detailed interpretation:https://go.hyper.ai/j9Z3C

Poly-EPO facilitates broader branching during inference generation.

The above is all the content of this week’s paper recommendation. For more cutting-edge AI research papers, please visit the “Latest Papers” section of hyper.ai’s official website.

See you next week!

Paper Roundup | Latest Advances in Large-Scale Reinforcement Learning: Microsoft, Google, Stanford, Renmin University, Xiaohongshu, and Others Release Major Achievements in Credit Allocation, Complex Reasoning, and Agent Reinforcement Learning

2 months ago

Information

Artificial Intelligence

Reinforcement Learning

Latest AI Papers:https://go.hyper.ai/hzChC

This week's paper recommendation

1 ECHO

Paper title:

ECHO: Terminal Agents Learn World Models for Free

Paper and detailed interpretation:https://go.hyper.ai/qma4O

2 DelTA

Paper title:

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

Paper and detailed interpretation:https://go.hyper.ai/IdI42

3 GoLongRL

Paper title:

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

Paper and detailed interpretation:https://go.hyper.ai/omy5E

The authors constructed a dataset containing 22,965 samples, covering nine capability-oriented tasks, with context lengths ranging from 0.1K to 256K tokens.

4 AntiSD

Paper title:

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

Paper and detailed interpretation:https://go.hyper.ai/Vax3f

5 RubricEM

Paper title:

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

Paper and detailed interpretation:https://go.hyper.ai/xSVTh

6 Poly-EPO

Paper title:

Poly-EPO: Training Exploratory Reasoning Models

Paper and detailed interpretation:https://go.hyper.ai/j9Z3C

The above is all the content of this week’s paper recommendation. For more cutting-edge AI research papers, please visit the “Latest Papers” section of hyper.ai’s official website.

See you next week!

Command Palette

Paper Roundup | Latest Advances in Large-Scale Reinforcement Learning: Microsoft, Google, Stanford, Renmin University, Xiaohongshu, and Others Release Major Achievements in Credit Allocation, Complex Reasoning, and Agent Reinforcement Learning

This week's paper recommendation

1 ECHO

2 DelTA

3 GoLongRL

4 AntiSD

5 RubricEM

6 Poly-EPO

Command Palette

Paper Roundup | Latest Advances in Large-Scale Reinforcement Learning: Microsoft, Google, Stanford, Renmin University, Xiaohongshu, and Others Release Major Achievements in Credit Allocation, Complex Reasoning, and Agent Reinforcement Learning

This week's paper recommendation

1 ECHO

2 DelTA

3 GoLongRL

4 AntiSD

5 RubricEM

6 Poly-EPO

Related News

Paper Weekly Report | Microsoft MAI-Thinking Explores self-evolution of Pure RL, Achieving an AIME Accuracy of 97%; VLM³ Achieves 3D Task Generalization Using Plain Text Coordinates Without Architectural Modifications… A Quick Overview of the week's cutting-edge AI Papers

Paper Weekly Report | ProgramBench Enables AI to Write Software From Scratch, With 9 Major Models Failing En Masse; ExoActor Demonstrates Strong Scene Generalization Ability Without Additional real-world Data… A Quick Overview of the week's cutting-edge AI Papers

Paper Weekly Report | DeepMind D4RT Unified Dynamic 4D Reconstruction, Inference Speed Surges 300x; Shattering the Illusion of AGI's Universality, Columbia University and Others Propose SAI Theory to Reshape the Goals of AI Evolution… A Quick Look at the Week's Cutting-Edge AI Papers

With a Speedup of 252 Times, Stanford, UCLA, and Other Institutions Have Used LSTM to Bring second-order Nonlinear Optical Simulations Into the Millisecond era.

ICML 26 Outstanding Papers: Tsinghua JustGRPO Overcomes the dLLM Inference Bottleneck; Say Goodbye to Simple Instruction Tests: Agents Last Exam Comprehensively Evaluates the long-range Professional Capabilities of Intelligent agents.

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

Free CPU Online Tutorial | Hermes Agent: Learn Long-Term Memory? The Memory Enhancement Plugin TencentDB Agent Memory Can Store Facts, Preferences, Task States, etc., separately.

Can Emojis Control Speech Generation? Irodori-TTS Is a Japanese TTS Based on the RF-DiT Architecture; Eczema and Tinea Skin Disease Datasets: Supporting Medical Image Classification and Transfer learning.

Command Palette

Paper Roundup | Latest Advances in Large-Scale Reinforcement Learning: Microsoft, Google, Stanford, Renmin University, Xiaohongshu, and Others Release Major Achievements in Credit Allocation, Complex Reasoning, and Agent Reinforcement Learning

This week's paper recommendation

1 ECHO

2 DelTA

3 GoLongRL

4 AntiSD

5 RubricEM

6 Poly-EPO

Related News

Paper Weekly Report | Microsoft MAI-Thinking Explores self-evolution of Pure RL, Achieving an AIME Accuracy of 97%; VLM³ Achieves 3D Task Generalization Using Plain Text Coordinates Without Architectural Modifications… A Quick Overview of the week's cutting-edge AI Papers

Paper Weekly Report | ProgramBench Enables AI to Write Software From Scratch, With 9 Major Models Failing En Masse; ExoActor Demonstrates Strong Scene Generalization Ability Without Additional real-world Data… A Quick Overview of the week's cutting-edge AI Papers

Paper Weekly Report | DeepMind D4RT Unified Dynamic 4D Reconstruction, Inference Speed Surges 300x; Shattering the Illusion of AGI's Universality, Columbia University and Others Propose SAI Theory to Reshape the Goals of AI Evolution… A Quick Look at the Week's Cutting-Edge AI Papers

With a Speedup of 252 Times, Stanford, UCLA, and Other Institutions Have Used LSTM to Bring second-order Nonlinear Optical Simulations Into the Millisecond era.

ICML 26 Outstanding Papers: Tsinghua JustGRPO Overcomes the dLLM Inference Bottleneck; Say Goodbye to Simple Instruction Tests: Agents Last Exam Comprehensively Evaluates the long-range Professional Capabilities of Intelligent agents.

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

Free CPU Online Tutorial | Hermes Agent: Learn Long-Term Memory? The Memory Enhancement Plugin TencentDB Agent Memory Can Store Facts, Preferences, Task States, etc., separately.

Can Emojis Control Speech Generation? Irodori-TTS Is a Japanese TTS Based on the RF-DiT Architecture; Eczema and Tinea Skin Disease Datasets: Supporting Medical Image Classification and Transfer learning.

Related News

Paper Weekly Report | Microsoft MAI-Thinking Explores self-evolution of Pure RL, Achieving an AIME Accuracy of 97%; VLM³ Achieves 3D Task Generalization Using Plain Text Coordinates Without Architectural Modifications… A Quick Overview of the week's cutting-edge AI Papers

Paper Weekly Report | ProgramBench Enables AI to Write Software From Scratch, With 9 Major Models Failing En Masse; ExoActor Demonstrates Strong Scene Generalization Ability Without Additional real-world Data… A Quick Overview of the week's cutting-edge AI Papers

Paper Weekly Report | DeepMind D4RT Unified Dynamic 4D Reconstruction, Inference Speed Surges 300x; Shattering the Illusion of AGI's Universality, Columbia University and Others Propose SAI Theory to Reshape the Goals of AI Evolution… A Quick Look at the Week's Cutting-Edge AI Papers

With a Speedup of 252 Times, Stanford, UCLA, and Other Institutions Have Used LSTM to Bring second-order Nonlinear Optical Simulations Into the Millisecond era.

ICML 26 Outstanding Papers: Tsinghua JustGRPO Overcomes the dLLM Inference Bottleneck; Say Goodbye to Simple Instruction Tests: Agents Last Exam Comprehensively Evaluates the long-range Professional Capabilities of Intelligent agents.

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

Free CPU Online Tutorial | Hermes Agent: Learn Long-Term Memory? The Memory Enhancement Plugin TencentDB Agent Memory Can Store Facts, Preferences, Task States, etc., separately.

Can Emojis Control Speech Generation? Irodori-TTS Is a Japanese TTS Based on the RF-DiT Architecture; Eczema and Tinea Skin Disease Datasets: Supporting Medical Image Classification and Transfer learning.

Related News

Paper Weekly Report | Microsoft MAI-Thinking Explores self-evolution of Pure RL, Achieving an AIME Accuracy of 97%; VLM³ Achieves 3D Task Generalization Using Plain Text Coordinates Without Architectural Modifications… A Quick Overview of the week's cutting-edge AI Papers

Paper Weekly Report | ProgramBench Enables AI to Write Software From Scratch, With 9 Major Models Failing En Masse; ExoActor Demonstrates Strong Scene Generalization Ability Without Additional real-world Data… A Quick Overview of the week's cutting-edge AI Papers

Paper Weekly Report | DeepMind D4RT Unified Dynamic 4D Reconstruction, Inference Speed Surges 300x; Shattering the Illusion of AGI's Universality, Columbia University and Others Propose SAI Theory to Reshape the Goals of AI Evolution… A Quick Look at the Week's Cutting-Edge AI Papers

With a Speedup of 252 Times, Stanford, UCLA, and Other Institutions Have Used LSTM to Bring second-order Nonlinear Optical Simulations Into the Millisecond era.

ICML 26 Outstanding Papers: Tsinghua JustGRPO Overcomes the dLLM Inference Bottleneck; Say Goodbye to Simple Instruction Tests: Agents Last Exam Comprehensively Evaluates the long-range Professional Capabilities of Intelligent agents.

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

Free CPU Online Tutorial | Hermes Agent: Learn Long-Term Memory? The Memory Enhancement Plugin TencentDB Agent Memory Can Store Facts, Preferences, Task States, etc., separately.

Can Emojis Control Speech Generation? Irodori-TTS Is a Japanese TTS Based on the RF-DiT Architecture; Eczema and Tinea Skin Disease Datasets: Supporting Medical Image Classification and Transfer learning.