HyperAIHyperAI

Command Palette

Search for a command to run...

AutoWebWorld:有限状態機械を用いた無限に検証可能なWeb環境の合成

概要

自律型Web GUIエージェントの性能は、その訓練データの質と量に大きく依存している。しかし根本的な課題が存在する:現実世界のウェブサイトからインタラクション軌道を収集することは高コストであり、検証も困難である。状態遷移の内部メカニズムは非可視であるため、ステップ単位の正しさを評価するには一貫性に欠ける高コストな外部検証者に頼らざるを得ない。これを解決するために、本研究では有限状態機械(FSM)としてモデル化することで、制御可能かつ検証可能なウェブ環境を合成する新しいフレームワーク「AutoWebWorld」を提案する。さらに、コードエージェントを用いてFSMをインタラクティブなウェブサイトに変換する。実際のウェブサイトとは異なり、状態遷移が暗黙的であるのに対し、AutoWebWorldではすべての状態、行動、遷移ルールを明示的に定義している。これにより、プログラムによる検証が可能となる:行動の正しさは事前に定義されたルールと照合され、タスクの成功はFSMグラフ上で目標状態に到達したことで確認される。AutoWebWorldにより、完全自動化された検索・検証パイプラインが実現され、29の多様なウェブ環境から11,663件以上の検証済み軌道を生成可能であり、1軌道あたりのコストはわずか0.04ドルにとどまる。この合成データを用いた学習により、実世界での性能が顕著に向上する。我々の7B規模のWeb GUIエージェントは、WebVoyagerにおいて15ステップ以内ですべてのベースラインを上回る性能を達成した。さらに、合成データ量の増加に伴い、WebVoyagerおよびOnline-Mind2Webにおける性能が一貫して向上する明確なスケーリング則が観察された。

One-sentence Summary

Researchers from multiple institutions propose AutoWebWorld, a framework generating verifiable synthetic web environments via FSMs to train GUI agents, enabling cost-efficient, scalable data production that boosts real-world performance without human verification.

Key Contributions

  • AutoWebWorld introduces a state-driven framework that models web environments as Finite State Machines (FSMs) to enable programmatic verification of agent actions and task success, eliminating reliance on costly and inconsistent human or LLM verifiers.
  • The system automatically generates 29 synthetic websites and 11,663 verified interaction trajectories at $0.04 per trajectory by executing breadth-first search over FSM graphs and validating outcomes via executable front-end renderings.
  • Training agents on this synthetic data yields state-of-the-art performance on WebVoyager within 15 steps and demonstrates a clear scaling law, where increasing synthetic data volume consistently improves real-world benchmark results on WebVoyager and Online-Mind2Web.

Introduction

The authors leverage Finite State Machines to synthesize verifiable web environments for training GUI agents, addressing the high cost and inconsistency of verifying real-world interaction trajectories. Prior methods rely on external verifiers—human annotators or LLMs—to judge correctness from opaque UI feedback, creating a costly and unreliable bottleneck. AutoWebWorld eliminates this by encoding explicit states, actions, and transitions into synthetic websites, enabling programmatic verification and scalable trajectory generation at $0.04 per trajectory. Their approach yields 11,663 verified trajectories across 29 environments, and training on this data improves real-world agent performance with a clear scaling law, demonstrating synthetic data’s potential to drive generalization in foundational models.

Dataset

The authors use AutoWebWorld, a synthesized GUI trajectory dataset built from 29 programmatically generated websites, to train and evaluate GUI agents. Here’s how the data is composed, processed, and used:

  • Dataset Composition and Sources:
    Each website is defined by three core files:

    • fsm.json: Encodes semantic transitions, grounded GUI actions, and success criteria (via terminal pages).
    • bfs.json: Contains BFS-generated trajectories with step-by-step semantic states and GUI procedures.
    • data.js: Structured backend data (e.g., provider lists, appointments, bills) used to render dynamic UI elements and generate visual-grounded queries.
  • Key Subset Details:

    • BFS-driven Queries: Generated directly from bfs.json, using interaction templates over verified trajectories.
    • Visual-grounded Queries: Built from data.js and rendered item images; target items are referenced via visual descriptions (not names), preserving interaction templates.
    • Screenshot QA Queries: Sampled from feature-based QA templates in data.js, then filtered via VLM to ensure the queried feature is visible in the screenshot.
    • All subsets use five standardized interaction modes: search, scroll, slider, sort, checkbox — differing only in grounding signal (name vs. visual description).
  • Training Data Usage:

    • 11,663 verified trajectories are synthesized across 29 websites.
    • To reduce redundancy, one trajectory per task is sampled from parallel BFS paths → 1,215 distinct trajectories (12,585 steps total).
    • Some trajectories are converted into grounding supervision: individual steps are extracted and rewritten as UI localization examples.
    • Final training set combines trajectory steps and grounding examples → ~16k total training steps for GRPO.
    • Mixture ratios are not explicitly stated, but grounding and trajectory data are unified into a single training corpus.
  • Processing and Filtering:

    • Trajectories undergo strict execution-based filtering via Playwright: each atomic GUI action is replayed; any failure (e.g., missing element, broken button) discards the entire trajectory.
    • Only reproducibly executable trajectories are retained, paired with grounded actions and state sequences.
    • Queries are generated post-filtering, with metadata (interaction mode, template parameters, thresholds) stored in lightweight manifests for reproducible analysis.
    • No cropping is applied; instead, visual grounding relies on feature-conditioned images and screenshot visibility checks.
    • Final dataset enables intrinsic verification (no human judges), with deterministic transitions and goal attainment defined by the FSM.

This pipeline yields high-quality, reproducible, long-horizon trajectories (avg. 21.94 steps) at low cost ($0.04 per trajectory), suitable for training and benchmarking GUI agents on real-world tasks.

Method

The authors leverage a transition-driven, multi-stage pipeline to generate synthetic web environments with intrinsic verification, enabling scalable trajectory synthesis and reproducible benchmarking. The core architecture, as depicted in the framework diagram, consists of four sequential phases: FSM generation, web environment synthesis, trajectory enumeration via BFS, and execution-based filtering.

In the first phase, a multi-agent system generates a Finite State Machine (FSM) specification from a given web theme. The FSM proposer drafts an initial structure including page definitions, signature variables, and transition rules. This candidate FSM is then validated by an automated validator that checks for structural soundness—such as reachability of terminal states and deterministic effect application—and returns revision suggestions if constraints are violated. An improver agent iteratively refines the FSM until validation passes. This loop ensures the FSM encodes a deterministic state-transition system where each state s=(p,σ)s = (p, \sigma)s=(p,σ) comprises a page identifier ppp and a structured signature σ\sigmaσ capturing task-relevant variables. Transitions are governed by explicit preconditions and effect rules, ensuring that the next state st+1=T(st,at)s_{t+1} = \mathcal{T}(s_t, a_t)st+1=T(st,at) is uniquely determined by the current state and action.

The second phase translates the validated FSM into an executable simulated web environment. A coding agent, guided by the FSM and a reference website for stylistic anchoring, follows a four-stage pipeline: (1) generating project guidelines and scaffolding, (2) synthesizing Vue components for each page with iterative self-review, (3) building the project, and (4) triggering a self-repair loop if build failures occur. Crucially, the generated DOM strictly implements the selectors defined in the FSM, creating a deterministic bridge between semantic actions and their GUI realizations.

In the third phase, the authors perform breadth-first search (BFS) over the FSM’s state graph to enumerate all possible trajectories from the initial state s0s_0s0 to goal states. Each node corresponds to a semantic state (p,σ)(p, \sigma)(p,σ), and edges represent executable actions. BFS uses signature hashing for deduplication and expands nodes only if preconditions are satisfied. Goal states are defined via predicates G(s)G(s)G(s) over signature variables, such as reaching a terminal page or satisfying a constraint like “cart contains at least one item.” This ensures trajectories are correct by construction at the semantic level.

The final phase involves grounding and filtering. Each BFS-derived action sequence is expanded into a sequence of atomic GUI operations (e.g., click, type) via its pre-defined gui_procedure, which specifies selectors and normalized coordinates. These sequences are replayed on the synthesized website using Playwright. Only trajectories that execute all steps successfully and reach the intended goal state are retained. This end-to-end process, illustrated in the figure, ensures that collected trajectories are both semantically valid and executable, enabling large-scale, verifiable data generation without human annotation.

Experiment

  • AutoWebWorld-synthesized trajectories enhance GUI agent performance in real-world navigation and grounding tasks, validated on WebVoyager and ScreenSpot benchmarks.
  • Training on verified synthetic data yields strong generalization, with Ours-7B outperforming open-source baselines and Ours-3B surpassing larger models despite using only 16K steps, demonstrating high data efficiency.
  • Grounding supervision from AutoWebWorld consistently improves performance on ScreenSpot-V2 and ScreenSpot-Pro, particularly in text and icon localization tasks.
  • Scaling synthetic data volume shows clear performance gains on real-world benchmarks, with success rates rising steadily as sample size increases, indicating sustained scalability.
  • Grounding data is critical for stable and effective GRPO training; its absence leads to early reward spikes but long-term performance degradation.
  • AutoWebWorld’s synthesized websites serve as challenging, reproducible benchmarks, with agents performing worse on them than on real sites, confirming their non-trivial difficulty.
  • Cost analysis reveals that per-step reasoning dominates expenses, highlighting optimization potential through reduced redundant planning and improved action constraints.
  • Training details confirm efficient distributed setup using 8 A800 GPUs, BF16 precision, FlashAttention-2, and DeepSpeed ZeRO-3 for memory optimization.

The authors use AutoWebWorld-synthesized trajectories to train GUI agents, achieving significant performance gains over baseline models on real-world navigation and grounding tasks. Results show that even smaller models like Ours-3B outperform larger baselines when trained on verified synthetic data, demonstrating strong data efficiency. Scaling the synthesized dataset further improves performance, and grounding data proves essential for stable and sustained reward growth during training.

The authors use AutoWebWorld-synthesized trajectories to train GUI agents, achieving strong real-world navigation performance with significantly fewer training samples than comparable models. Results show that their 7B model outperforms other open-source baselines on WebVoyager, while their 3B model surpasses several larger models, demonstrating high data efficiency. Performance scales predictably with increased synthetic data volume, and grounding supervision further stabilizes and improves training outcomes.

The authors use AutoWebWorld to generate synthesized GUI environments with inherently verified trajectories, achieving strong real-world navigation and grounding performance despite using far fewer training samples than comparable methods. Results show that scaling the synthesized data volume leads to consistent improvements on real-world benchmarks, and grounding data plays a critical role in stabilizing and enhancing reward learning during training. The synthesized environments also present non-trivial challenges comparable to real websites, while offering full controllability and reproducibility at a significantly lower cost per trajectory.

The authors use AutoWebWorld to generate synthetic GUI trajectories and grounding data, which significantly improve agent performance on real-world navigation and grounding benchmarks, even with limited training samples. Results show that scaling up synthesized data leads to consistent gains, and grounding supervision is critical for stable reward learning during training. The cost analysis reveals that per-step reasoning dominates expenses, highlighting the need to optimize planning efficiency rather than environment execution.

The authors use AutoWebWorld-synthesized trajectories to train GUI agents, achieving significant improvements in grounding performance across both ScreenSpot-V2 and ScreenSpot-Pro benchmarks. Results show consistent gains for both 3B and 7B models, particularly in text and icon localization tasks, indicating that the synthetic data effectively transfers to real-world grounding challenges. The improvements are more pronounced on ScreenSpot-Pro, suggesting the synthesized data better captures complex, real-world grounding demands.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています