HyperAIHyperAI

Command Palette

Search for a command to run...

OpenClaw-RL: 対話のみによる任意の Agent の訓練

Yinjie Wang Xuyang Chen Xiaolong Jin Mengdi Wang Ling Yang

概要

各エージェントの相互作用は、直後のユーザー応答、ツール出力、ターミナルまたは GUI の状態変化といった「次状態信号」を生成するが、既存のエージェント型強化学習(RL)システムは、これをライブかつオンラインな学習源として復元・活用していない。我々は、次状態信号は普遍的であり、方策(policy)はそれらすべてから同時に学習可能であるという単純な観察に基づき、OpenClaw-RL というフレームワークを提案する。個人的な会話、ターミナル実行、GUI 操作、ソフトウェアエンジニアリング(SWE)タスク、およびツール呼び出しのトレースは、別個の学習問題ではなく、同一のループ内で同一の方策を訓練するために利用可能な相互作用である。次状態信号は、2 種類の情報を符号化する。第 1 は評価信号であり、行動の性能の良さを示し、PRM(Process Reward Model)判定器を介してスカラー報酬として抽出される。第 2 は指示信号であり、行動がどのように異なればよいかを示し、Hindsight-Guided On-Policy Distillation(OPD)を通じて復元される。我々は次状態からテキストヒントを抽出し、強化された教師コンテキストを構築するとともに、任意のスカラー報酬よりも情報量の多いトークンレベルの方向性アドバンテージ(directional advantage)による監督を提供する。非同期設計により、モデルはライブリクエストにサービスを提供し、PRM は進行中の相互作用を判定し、トレーナーは方策を更新するが、これら三者間の調整オーバーヘッドはゼロである。パーソナルエージェントへの適用において、OpenClaw-RL は、エージェントが単に使用されることだけで改善可能とし、ユーザーの再問い合わせ、修正、明示的フィードバックから会話信号を復元する。一般エージェントへの適用においては、同一のインフラがターミナル、GUI、SWE、ツール呼び出しの各設定にわたってスケーラブルな RL を支援し、さらにプロセス報酬の有用性を実証する。コード:https://github.com/Gen-Verse/OpenClaw-RL

One-sentence Summary

The authors from OpenClaw propose OpenClaw-RL, a unified framework that transforms universal next-state signals into live online learning sources via binary reinforcement learning and Hindsight-Guided On-Policy Distillation. This approach enables continuous policy improvement for both personal and general agents across diverse scenarios like terminal, GUI, and software engineering tasks without interrupting service.

Key Contributions

  • OpenClaw-RL addresses the problem of discarded next-state signals in AI agents by treating user replies, tool results, and error traces as implicit, free-form evaluations rather than mere context for future actions.
  • The framework introduces a unified asynchronous architecture that recovers both scalar process rewards via a PRM judge and token-level directional supervision through Hindsight-Guided On-Policy Distillation (OPD) from live interaction data.
  • Experiments demonstrate that combining these methods significantly improves performance for personal agents in conversational settings and general agents across terminal, GUI, SWE, and tool-call environments by providing dense credit assignment for long-horizon tasks.

Introduction

Deployed AI agents continuously generate valuable next-state signals, such as user replies or test results, yet current systems discard this data by treating it only as context for future actions rather than a source of live learning. Existing reinforcement learning approaches typically rely on offline batch data, scalar outcome rewards that lack step-level granularity, or pre-curated feedback pairs, which prevents continuous optimization during real-world deployment. The authors introduce OpenClaw-RL, a unified asynchronous infrastructure that recovers these implicit signals to enable online training for both personal and general agents. Their framework leverages Process Reward Models to extract dense step-wise rewards from live interactions and employs Hindsight-Guided On-Policy Distillation to convert textual error traces into directional token-level supervision without requiring external annotators.

Dataset

  • Dataset Composition and Sources: The authors curate a multi-scenario dataset to support terminal, GUI, software engineering (SWE), and tool-call agents by combining four distinct sources: SETA RL data, OSWorld-Verified, SWE-Bench-Verified, and DAPO RL data.

  • Key Details for Each Subset:

    • Terminal Agents: Trained on SETA RL data to leverage efficient text-based interfaces.
    • GUI Agents: Trained on OSWorld-Verified data to handle visual interfaces and pointer interactions, with evaluation restricted to the training set after excluding Chrome and multi-app tasks.
    • SWE Agents: Trained on SWE-Bench-Verified data to utilize rich executable feedback like tests and diffs.
    • Tool-Call Agents: Trained on DAPO RL data to enhance reasoning and factual accuracy, with evaluation performed on the AIME 2024 mathematics competition dataset.
  • Model Usage and Training Strategy: The authors apply specific model configurations to each subset, using Qwen3-8B for terminal tasks, Qwen3VL-8B-Thinking for GUI tasks, Qwen3-32B for SWE tasks, and Qwen3-4B-SFT for tool-call tasks. Performance for terminal and SWE agents is measured by averaging rollout-task accuracy over a window of reinforcement learning steps.

  • Processing and Metadata Construction: For GUI agent evaluation, the authors construct a strict step-level feedback prompt that combines text instructions, historical actions, and base64-encoded images of the environment state before and after an action. This prompt instructs the evaluator to assign a binary score of +1 or -1 based on whether the action is relevant, executable, and results in concrete progress toward the objective.

Method

The OpenClaw-RL framework is built on the observation that next-state signals, such as user replies or tool outputs, encode both evaluative and directive information about an agent's actions. The system unifies the training of personal and general agents through a fully asynchronous pipeline that decouples policy serving, environment hosting, reward judging, and policy training. Refer to the framework diagram for the overall architecture. The infrastructure connects Personal Agents and General Agents to Environment Servers, which handle confidential API keys and large-scale environments respectively. These servers feed data into an RL Server comprising a Training Engine (Megatron), a Policy Server (SGLang), and a PRM Server. This decoupled design ensures that the model can serve live requests while the PRM judges interactions and the trainer updates weights without blocking dependencies.

The learning process leverages next-state signals through two complementary methods. As shown in the figure below, the framework supports Binary RL for evaluative signals and On-policy Distillation for directive signals. In the Binary RL approach, a PRM judge evaluates the quality of an action ata_tat given the next state st+1s_{t+1}st+1, producing a scalar reward r{+1,1,0}r \in \{+1, -1, 0\}r{+1,1,0}. This reward serves as the advantage AtA_tAt in a standard PPO-style clipped surrogate objective.

For more granular improvement, Hindsight-Guided On-Policy Distillation (OPD) extracts textual hints from the next state to construct an enhanced teacher context. The advantage is calculated as the per-token log-probability gap between the teacher model, which conditions on the hint, and the student model: At=logπteacher(atsenhanced)logπθ(atst)A_t = \log \pi_{\text{teacher}} (a_t \mid s_{\text{enhanced}}) - \log \pi_{\theta} (a_t \mid s_t)At=logπteacher(atsenhanced)logπθ(atst) This provides directional guidance at the token level, indicating which tokens should be upweighted or downweighted. For general agents, the system further integrates step-wise rewards with outcome rewards, utilizing step-wise standardization to handle long-horizon trajectories. The authors combine both methods by weighting their respective advantages, allowing the policy to benefit from broad coverage via scalar rewards and high-resolution corrections via distillation.

Experiment

  • The personal agent track validates that conversational next-state signals enable continuous personalization, with a combined optimization method outperforming binary reinforcement learning and on-policy distillation to help agents adopt natural writing styles and provide friendlier feedback after minimal interactions.
  • The general agent track demonstrates that the unified infrastructure supports scalable reinforcement learning across terminal, GUI, software engineering, and tool-call scenarios through large-scale environment parallelization.
  • Experiments confirm that integrating process reward models with outcome rewards yields stronger optimization for long-horizon tasks compared to outcome-only approaches, despite the additional resource requirements.

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています