HyperAIHyperAI

Command Palette

Search for a command to run...

Vision2Web: 에이전트 검증을 통한 시각적 웹 개발을 위한 계층적 벤치마크

Zehai He Wenyi Hong Zhen Yang Ziyang Pan Mingdao Liu Xiaotao Gu Jie Tang

초록

최근 대규모 언어 모델 (LLM) 의 발전으로 코딩 에이전트 (coding agent) 의 능력이 향상되었으나, 복잡하고 종단간 (end-to-end) 웹사이트 개발에 대한 체계적인 평가는 여전히 제한적입니다. 이러한 격차를 해소하기 위해 우리는 시각적 웹사이트 개발을 위한 계층적 벤치마크인 Vision2Web 을 제안합니다. Vision2Web 은 정적 UI-코드 생성부터 인터랙티브한 멀티페이지 프론트엔드 재현, 그리고 장기적(full-horizon) 풀스택 웹사이트 개발에 이르기까지 다양한 과제를 포괄합니다. 해당 벤치마크는 실제 웹사이트를 기반으로 구축되었으며, 16 개 카테고리에 걸쳐 총 193 개의 과제, 918 개의 프로토타입 이미지, 1,255 개의 테스트 케이스로 구성되었습니다. 유연하고 철저하며 신뢰할 수 있는 평가를 지원하기 위해 우리는 GUI 에이전트 검증자 (verifier) 와 VLM 기반의 저지 (judge) 라는 두 가지 상호 보완적 구성 요소를 기반으로 한 워크플로우 기반 에이전트 검증 패러다임을 제안합니다. 다양한 코딩 에이전트 프레임워크 하에서 구현된 여러 시각 언어 모델 (VLM) 을 평가한 결과, 모든 과제 수준에서 상당한 성능 격차가 드러났으며, 최첨단 모델조차 풀스택 개발 과제에서는 여전히 어려움을 겪는 것으로 확인되었습니다.

One-sentence Summary

Researchers from Tsinghua University and another institute introduce Vision2Web, a hierarchical benchmark for visual website development that evaluates LLMs across static UI-to-code and full-stack tasks. Using a novel workflow-based agent verification paradigm, the study reveals significant performance gaps in current models for complex, end-to-end web creation.

Key Contributions

  • The paper introduces Vision2Web, a hierarchical benchmark for visual website development that spans static UI-to-code generation, interactive multi-page frontend reproduction, and long-horizon full-stack development using 193 real-world tasks and 918 prototype images.
  • A workflow-based agent verification paradigm is presented to ensure reproducible evaluation by structuring tests as directed dependency graphs with explicitly defined nodes that constrain agent execution while maintaining flexibility.
  • The work implements two complementary verification components, a GUI agent verifier for functional correctness and a VLM-based judge for visual fidelity, which experiments show reveal substantial performance gaps in state-of-the-art models on full-stack development tasks.

Introduction

Developing and evaluating autonomous agents for visual website creation is critical as these systems move from simple code generation to full end-to-end software development. Prior evaluation methods struggle because traditional unit tests cannot handle diverse implementations, while existing agent-based evaluators often behave unpredictably due to loosely specified objectives. Furthermore, visual testing relies on brittle rule-based scripts or pixel-level comparisons that fail to capture human perceptual judgments. To address these gaps, the authors introduce Vision2Web, a hierarchical benchmark that employs a workflow-based agent verification paradigm. This approach constrains agent execution through structured test workflows and explicit verification nodes, enabling reproducible and implementation-agnostic assessment of both functional correctness and visual fidelity within a unified framework.

Dataset

  • Dataset Composition and Sources

    • The authors construct Vision2Web from real-world websites sourced exclusively from the C4 validation set to prevent data leakage.
    • The benchmark spans four major categories (Content, Transaction, SaaS Platforms, Public Services) and 16 subcategories to ensure diversity.
    • It includes a multimedia resource library containing images, icons, videos, and fonts to simulate realistic development environments.
  • Key Details for Each Subset

    • The dataset comprises 193 tasks divided into three hierarchical levels of increasing complexity:
      • Static Webpage (100 tasks): Focuses on visual fidelity across desktop, tablet, and mobile resolutions using prototype images.
      • Interactive Frontend (66 tasks): Requires generating multi-page frontends with coherent navigation flows based on multiple prototypes and text descriptions.
      • Full-Stack Website (27 tasks): Simulates realistic engineering scenarios with requirement documents, complex state management, and backend integration.
    • The collection includes 918 prototype images and 1,255 test cases, totaling 21,516 input files.
  • Data Processing and Filtering Pipeline

    • A three-stage filtering pipeline refines the initial web corpus:
      • Structural Assessment: Analyzes DOM properties like tag distribution and tree depth to exclude simple or malformed pages, reducing candidates to 63,515.
      • Content Screening: Uses VLM-based scoring to retain only 7,391 pages with functional richness and visual coherence.
      • Manual Review: Human annotators verify page consistency, implementation difficulty, and category balance to finalize the task set.
    • Test case annotation employs an expert-in-the-loop strategy where PhD researchers draft high-level workflows and Claude Code refines them into executable sequences.
  • Usage in Model Evaluation

    • The authors utilize the dataset to evaluate multimodal coding agents via a workflow-based agent verification paradigm.
    • Evaluation relies on a GUI agent verifier to execute test workflows and a VLM-based judge to quantitatively assess visual fidelity against prototypes.
    • The benchmark measures both functional correctness and visual fidelity without relying on external orchestration layers, ensuring agents depend solely on their own reasoning and coding capabilities.

Method

The proposed framework for automated website evaluation is structured into three sequential phases: Hierarchical Task Formulation, Coding Agent generation, and Workflow-based Agent Verification. This pipeline ensures a systematic approach to generating and validating full-stack web applications.

The process begins with Hierarchical Task Formulation, which decomposes the development objective into three distinct levels of complexity. Level 1 targets the creation of a Static Webpage, focusing on responsive HTML/CSS output across multiple devices. Level 2 advances to an Interactive Frontend, incorporating inter-page logic to produce an interactive frontend output. Finally, Level 3 addresses the Full-Stack Website, integrating requirements, databases, and assets to generate a complete system output.

Following the task definition, the Coding Agent module utilizes multimodal resources to synthesize the website. This central component processes the specifications from the formulation phase to generate the actual code and assets required for the target system.

The final stage employs Workflow-based Agent Verification to assess the generated output. This stage formalizes end-to-end testing as a directed dependency graph where nodes represent self-contained verification sub-procedures and edges encode sequential dependencies. To balance evaluation stability and coverage efficiency, the system constructs test workflows by decoupling dependent test nodes to prevent error propagation and integrating related test nodes within the same application context.

Verification nodes are categorized into two complementary types. Functional Verification Nodes assess interaction fidelity and are formalized as a 3-tuple ni=Oi,Ai,Vin_i = \langle O_i, A_i, V_i \rangleni=Oi,Ai,Vi, where OiO_iOi specifies the testing objective, AiA_iAi defines guided actions, and ViV_iVi encodes validation criteria. A GUI Agent Verifier executes these nodes, maintaining a context Ci={H<i,Oi,Ai,Vi}\mathcal{C}_i = \{ \mathcal{H}_{<i}, O_i, A_i, V_i \}Ci={H<i,Oi,Ai,Vi} that includes historical objectives and actions to ensure reproducible state transitions. The Functional Score (FS) is computed as the proportion of passed functional verification nodes.

Visual Verification Nodes assess visual fidelity by comparing rendered pages against reference prototypes. Each node is formalized as ni=Pin_i = \langle P_i \rangleni=Pi, where PiP_iPi denotes the target prototype. A dedicated VLM Judge is invoked to perform component-level comparisons, assigning fidelity scores based on predefined visual rubrics. The Visual Score (VS) is calculated as the average of all block-level scores across the prototypes. This dual-verifier approach allows for a granular and systematic assessment of both the functional logic and the visual consistency of the generated website.

Experiment

  • Vision2Web evaluates eight state-of-the-art multimodal models across two coding agent frameworks to assess their capabilities in visual website development, revealing that performance consistently degrades as task complexity increases from static pages to full-stack applications.
  • Agents struggle significantly with smaller device form factors and visually dense prototypes, indicating limited capacity for complex visual reasoning and responsive layout adaptation.
  • Claude-Opus-4.5 demonstrates superior performance across frameworks and task levels compared to other models, while several agents fail entirely on complex full-stack tasks involving multi-page integration.
  • Systematic weaknesses are observed in state-dependent operations such as state management and CRUD operations, whereas navigation and authentication tasks are handled more reliably.
  • Failure analysis identifies distinct gaps in fine-grained visual alignment, cross-module consistency, and long-horizon system planning, which compound as development scope expands.
  • The study validates the reliability of its evaluation pipeline, showing high agreement between the automated GUI agent verifier and human annotations, as well as strong rank consistency between the VLM-based judge and human preferences.

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp