HyperAIHyperAI

Command Palette

Search for a command to run...

OpenClaw-RL : Entraîner n'importe quel Agent simplement par la conversation

Yinjie Wang Xuyang Chen Xiaolong Jin Mengdi Wang Ling Yang

Résumé

Chaque interaction d'agent génère un signal d'état suivant, à savoir la réponse de l'utilisateur, la sortie d'un outil, ou l'évolution de l'état d'un terminal ou d'une interface graphique (GUI) qui suit chaque action ; pourtant, aucun système d'apprentissage par renforcement (RL) pour agents existant n'exploite ces signaux comme source d'apprentissage en temps réel et en ligne. Nous présentons OpenClaw-RL, un cadre fondé sur une observation simple : les signaux d'état suivant sont universels, et une politique unique peut apprendre simultanément de l'ensemble de ces signaux. Les conversations personnelles, les exécutions dans un terminal, les interactions via une GUI, les tâches de développement logiciel (SWE) et les traces d'appels d'outils ne constituent pas des problèmes d'entraînement distincts. Il s'agit d'interactions qui peuvent toutes servir à entraîner une même politique au sein d'une même boucle d'apprentissage. Les signaux d'état suivant encodent deux formes d'information : des signaux évaluatifs, qui indiquent la qualité de l'action exécutée et sont extraits sous forme de récompenses scalaires via un juge de modèle de récompense de processus (PRM) ; et des signaux directifs, qui précisent comment l'aurait dû être différente l'action, et sont récupérés grâce à une distillation guidée par rétrospective en ligne (OPD, Hindsight-Guided On-Policy Distillation). Nous extrayons des indices textuels à partir de l'état suivant, construisons un contexte enseignant enrichi et fournissons une supervision directionnelle au niveau des tokens, offrant une information plus riche que toute récompense scalaire. Grâce à une conception asynchrone, le modèle sert des requêtes en temps réel, le PRM évalue les interactions en cours et le module d'entraînement met à jour la politique simultanément, sans aucune surcharge de coordination entre ces composants. Appliqué aux agents personnels, OpenClaw-RL permet à un agent de s'améliorer simplement en étant utilisé, en récupérant des signaux conversationnels issus de nouvelles requêtes de l'utilisateur, de corrections et de retours explicites. Appliqué aux agents généraux, la même infrastructure prend en charge un apprentissage par renforcement évolutif dans les environnements terminal, GUI, SWE et d'appels d'outils ; nous y démontrons en outre l'utilité des récompenses de processus. Code : https://github.com/Gen-Verse/OpenClaw-RL

One-sentence Summary

The authors from OpenClaw propose OpenClaw-RL, a unified framework that transforms universal next-state signals into live online learning sources via binary reinforcement learning and Hindsight-Guided On-Policy Distillation. This approach enables continuous policy improvement for both personal and general agents across diverse scenarios like terminal, GUI, and software engineering tasks without interrupting service.

Key Contributions

  • OpenClaw-RL addresses the problem of discarded next-state signals in AI agents by treating user replies, tool results, and error traces as implicit, free-form evaluations rather than mere context for future actions.
  • The framework introduces a unified asynchronous architecture that recovers both scalar process rewards via a PRM judge and token-level directional supervision through Hindsight-Guided On-Policy Distillation (OPD) from live interaction data.
  • Experiments demonstrate that combining these methods significantly improves performance for personal agents in conversational settings and general agents across terminal, GUI, SWE, and tool-call environments by providing dense credit assignment for long-horizon tasks.

Introduction

Deployed AI agents continuously generate valuable next-state signals, such as user replies or test results, yet current systems discard this data by treating it only as context for future actions rather than a source of live learning. Existing reinforcement learning approaches typically rely on offline batch data, scalar outcome rewards that lack step-level granularity, or pre-curated feedback pairs, which prevents continuous optimization during real-world deployment. The authors introduce OpenClaw-RL, a unified asynchronous infrastructure that recovers these implicit signals to enable online training for both personal and general agents. Their framework leverages Process Reward Models to extract dense step-wise rewards from live interactions and employs Hindsight-Guided On-Policy Distillation to convert textual error traces into directional token-level supervision without requiring external annotators.

Dataset

  • Dataset Composition and Sources: The authors curate a multi-scenario dataset to support terminal, GUI, software engineering (SWE), and tool-call agents by combining four distinct sources: SETA RL data, OSWorld-Verified, SWE-Bench-Verified, and DAPO RL data.

  • Key Details for Each Subset:

    • Terminal Agents: Trained on SETA RL data to leverage efficient text-based interfaces.
    • GUI Agents: Trained on OSWorld-Verified data to handle visual interfaces and pointer interactions, with evaluation restricted to the training set after excluding Chrome and multi-app tasks.
    • SWE Agents: Trained on SWE-Bench-Verified data to utilize rich executable feedback like tests and diffs.
    • Tool-Call Agents: Trained on DAPO RL data to enhance reasoning and factual accuracy, with evaluation performed on the AIME 2024 mathematics competition dataset.
  • Model Usage and Training Strategy: The authors apply specific model configurations to each subset, using Qwen3-8B for terminal tasks, Qwen3VL-8B-Thinking for GUI tasks, Qwen3-32B for SWE tasks, and Qwen3-4B-SFT for tool-call tasks. Performance for terminal and SWE agents is measured by averaging rollout-task accuracy over a window of reinforcement learning steps.

  • Processing and Metadata Construction: For GUI agent evaluation, the authors construct a strict step-level feedback prompt that combines text instructions, historical actions, and base64-encoded images of the environment state before and after an action. This prompt instructs the evaluator to assign a binary score of +1 or -1 based on whether the action is relevant, executable, and results in concrete progress toward the objective.

Method

The OpenClaw-RL framework is built on the observation that next-state signals, such as user replies or tool outputs, encode both evaluative and directive information about an agent's actions. The system unifies the training of personal and general agents through a fully asynchronous pipeline that decouples policy serving, environment hosting, reward judging, and policy training. Refer to the framework diagram for the overall architecture. The infrastructure connects Personal Agents and General Agents to Environment Servers, which handle confidential API keys and large-scale environments respectively. These servers feed data into an RL Server comprising a Training Engine (Megatron), a Policy Server (SGLang), and a PRM Server. This decoupled design ensures that the model can serve live requests while the PRM judges interactions and the trainer updates weights without blocking dependencies.

The learning process leverages next-state signals through two complementary methods. As shown in the figure below, the framework supports Binary RL for evaluative signals and On-policy Distillation for directive signals. In the Binary RL approach, a PRM judge evaluates the quality of an action ata_tat given the next state st+1s_{t+1}st+1, producing a scalar reward r{+1,1,0}r \in \{+1, -1, 0\}r{+1,1,0}. This reward serves as the advantage AtA_tAt in a standard PPO-style clipped surrogate objective.

For more granular improvement, Hindsight-Guided On-Policy Distillation (OPD) extracts textual hints from the next state to construct an enhanced teacher context. The advantage is calculated as the per-token log-probability gap between the teacher model, which conditions on the hint, and the student model: At=logπteacher(atsenhanced)logπθ(atst)A_t = \log \pi_{\text{teacher}} (a_t \mid s_{\text{enhanced}}) - \log \pi_{\theta} (a_t \mid s_t)At=logπteacher(atsenhanced)logπθ(atst) This provides directional guidance at the token level, indicating which tokens should be upweighted or downweighted. For general agents, the system further integrates step-wise rewards with outcome rewards, utilizing step-wise standardization to handle long-horizon trajectories. The authors combine both methods by weighting their respective advantages, allowing the policy to benefit from broad coverage via scalar rewards and high-resolution corrections via distillation.

Experiment

  • The personal agent track validates that conversational next-state signals enable continuous personalization, with a combined optimization method outperforming binary reinforcement learning and on-policy distillation to help agents adopt natural writing styles and provide friendlier feedback after minimal interactions.
  • The general agent track demonstrates that the unified infrastructure supports scalable reinforcement learning across terminal, GUI, software engineering, and tool-call scenarios through large-scale environment parallelization.
  • Experiments confirm that integrating process reward models with outcome rewards yields stronger optimization for long-horizon tasks compared to outcome-only approaches, despite the additional resource requirements.

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp