HyperAIHyperAI

Command Palette

Search for a command to run...

لا يمكنني الرد باللغة العربية لأن طلبك يتضمن ترجمة نص من الإنجليزية إلى الصينية (مع الحفاظ على المصطلحات التقنية بالإنجليزية)، بينما تطلب الإجابة باللغة العربية. هذا تناقض في التعليمات. إذا كنت ترغب في ترجمة العنوان "Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification" إلى الصينية مع الحفاظ على المصطلحات التقنية كما هو مطلوب، فالترجمة الصحيحة هي: Vision2Web: معيار هرمي لتطوير المواقع البصرية مع تحقق Agent إذا كنت ترغب في أي شيء آخر أو توضيح، يرجى إخباري.

Zehai He Wenyi Hong Zhen Yang Ziyang Pan Mingdao Liu Xiaotao Gu Jie Tang

الملخص

أحدثت التطورات الحديثة في نماذج اللغة الكبيرة (LLMs) قفزات نوعية في قدرات وكلاء البرمجة (coding agents)، غير أن التقييم المنهجي لتطوير المواقع الإلكترونية المعقد من البداية إلى النهاية لا يزال محدودًا. ولمعالجة هذه الفجوة، نقدم "فيجن2ويب" (Vision2Web)، وهو معيار تقييم هرمي (hierarchical benchmark) مخصص لتطوير المواقع المرئي، يمتد من توليد الكود من واجهات المستخدم الثابتة (UI-to-code generation)، إلى إعادة إنتاج الواجهات الأمامية التفاعلية متعددة الصفحات، وصولًا إلى تطوير المواقع الكامل (full-stack) على مدى زمني طويل. يتم بناء هذا المعيار استنادًا إلى مواقع ويب حقيقية، ويتضمن ما مجموعه 193 مهمة موزعة على 16 فئة، مع توفر 918 صورة نموذجية (prototype images) و1,255 حالة اختبار. لدعم التقييم المرن والشامل والموثوق، نقترح نموذج تحقق قائم على سير العمل (workflow-based agent verification paradigm) يعتمد على مكونين مكملين: محقق وكيل واجهة المستخدم الرسومية (GUI agent verifier)، وقاضٍ مبني على نماذج اللغة البصرية (VLM-based judge). وقد قمنا بتقييم نماذج لغوية بصرية متعددة مُطبَّقة ضمن أطر عمل مختلفة لوكلاء البرمجة، وكشفت النتائج عن فجوات أداء جوهرية على جميع مستويات المهام، حيث لا تزال أحدث النماذج (state-of-the-art) تواجه صعوبات في تطوير المواقع الكامل (full-stack development).

One-sentence Summary

Researchers from Tsinghua University and another institute introduce Vision2Web, a hierarchical benchmark for visual website development that evaluates LLMs across static UI-to-code and full-stack tasks. Using a novel workflow-based agent verification paradigm, the study reveals significant performance gaps in current models for complex, end-to-end web creation.

Key Contributions

  • The paper introduces Vision2Web, a hierarchical benchmark for visual website development that spans static UI-to-code generation, interactive multi-page frontend reproduction, and long-horizon full-stack development using 193 real-world tasks and 918 prototype images.
  • A workflow-based agent verification paradigm is presented to ensure reproducible evaluation by structuring tests as directed dependency graphs with explicitly defined nodes that constrain agent execution while maintaining flexibility.
  • The work implements two complementary verification components, a GUI agent verifier for functional correctness and a VLM-based judge for visual fidelity, which experiments show reveal substantial performance gaps in state-of-the-art models on full-stack development tasks.

Introduction

Developing and evaluating autonomous agents for visual website creation is critical as these systems move from simple code generation to full end-to-end software development. Prior evaluation methods struggle because traditional unit tests cannot handle diverse implementations, while existing agent-based evaluators often behave unpredictably due to loosely specified objectives. Furthermore, visual testing relies on brittle rule-based scripts or pixel-level comparisons that fail to capture human perceptual judgments. To address these gaps, the authors introduce Vision2Web, a hierarchical benchmark that employs a workflow-based agent verification paradigm. This approach constrains agent execution through structured test workflows and explicit verification nodes, enabling reproducible and implementation-agnostic assessment of both functional correctness and visual fidelity within a unified framework.

Dataset

  • Dataset Composition and Sources

    • The authors construct Vision2Web from real-world websites sourced exclusively from the C4 validation set to prevent data leakage.
    • The benchmark spans four major categories (Content, Transaction, SaaS Platforms, Public Services) and 16 subcategories to ensure diversity.
    • It includes a multimedia resource library containing images, icons, videos, and fonts to simulate realistic development environments.
  • Key Details for Each Subset

    • The dataset comprises 193 tasks divided into three hierarchical levels of increasing complexity:
      • Static Webpage (100 tasks): Focuses on visual fidelity across desktop, tablet, and mobile resolutions using prototype images.
      • Interactive Frontend (66 tasks): Requires generating multi-page frontends with coherent navigation flows based on multiple prototypes and text descriptions.
      • Full-Stack Website (27 tasks): Simulates realistic engineering scenarios with requirement documents, complex state management, and backend integration.
    • The collection includes 918 prototype images and 1,255 test cases, totaling 21,516 input files.
  • Data Processing and Filtering Pipeline

    • A three-stage filtering pipeline refines the initial web corpus:
      • Structural Assessment: Analyzes DOM properties like tag distribution and tree depth to exclude simple or malformed pages, reducing candidates to 63,515.
      • Content Screening: Uses VLM-based scoring to retain only 7,391 pages with functional richness and visual coherence.
      • Manual Review: Human annotators verify page consistency, implementation difficulty, and category balance to finalize the task set.
    • Test case annotation employs an expert-in-the-loop strategy where PhD researchers draft high-level workflows and Claude Code refines them into executable sequences.
  • Usage in Model Evaluation

    • The authors utilize the dataset to evaluate multimodal coding agents via a workflow-based agent verification paradigm.
    • Evaluation relies on a GUI agent verifier to execute test workflows and a VLM-based judge to quantitatively assess visual fidelity against prototypes.
    • The benchmark measures both functional correctness and visual fidelity without relying on external orchestration layers, ensuring agents depend solely on their own reasoning and coding capabilities.

Method

The proposed framework for automated website evaluation is structured into three sequential phases: Hierarchical Task Formulation, Coding Agent generation, and Workflow-based Agent Verification. This pipeline ensures a systematic approach to generating and validating full-stack web applications.

The process begins with Hierarchical Task Formulation, which decomposes the development objective into three distinct levels of complexity. Level 1 targets the creation of a Static Webpage, focusing on responsive HTML/CSS output across multiple devices. Level 2 advances to an Interactive Frontend, incorporating inter-page logic to produce an interactive frontend output. Finally, Level 3 addresses the Full-Stack Website, integrating requirements, databases, and assets to generate a complete system output.

Following the task definition, the Coding Agent module utilizes multimodal resources to synthesize the website. This central component processes the specifications from the formulation phase to generate the actual code and assets required for the target system.

The final stage employs Workflow-based Agent Verification to assess the generated output. This stage formalizes end-to-end testing as a directed dependency graph where nodes represent self-contained verification sub-procedures and edges encode sequential dependencies. To balance evaluation stability and coverage efficiency, the system constructs test workflows by decoupling dependent test nodes to prevent error propagation and integrating related test nodes within the same application context.

Verification nodes are categorized into two complementary types. Functional Verification Nodes assess interaction fidelity and are formalized as a 3-tuple ni=Oi,Ai,Vin_i = \langle O_i, A_i, V_i \rangleni=Oi,Ai,Vi, where OiO_iOi specifies the testing objective, AiA_iAi defines guided actions, and ViV_iVi encodes validation criteria. A GUI Agent Verifier executes these nodes, maintaining a context Ci={H<i,Oi,Ai,Vi}\mathcal{C}_i = \{ \mathcal{H}_{<i}, O_i, A_i, V_i \}Ci={H<i,Oi,Ai,Vi} that includes historical objectives and actions to ensure reproducible state transitions. The Functional Score (FS) is computed as the proportion of passed functional verification nodes.

Visual Verification Nodes assess visual fidelity by comparing rendered pages against reference prototypes. Each node is formalized as ni=Pin_i = \langle P_i \rangleni=Pi, where PiP_iPi denotes the target prototype. A dedicated VLM Judge is invoked to perform component-level comparisons, assigning fidelity scores based on predefined visual rubrics. The Visual Score (VS) is calculated as the average of all block-level scores across the prototypes. This dual-verifier approach allows for a granular and systematic assessment of both the functional logic and the visual consistency of the generated website.

Experiment

  • Vision2Web evaluates eight state-of-the-art multimodal models across two coding agent frameworks to assess their capabilities in visual website development, revealing that performance consistently degrades as task complexity increases from static pages to full-stack applications.
  • Agents struggle significantly with smaller device form factors and visually dense prototypes, indicating limited capacity for complex visual reasoning and responsive layout adaptation.
  • Claude-Opus-4.5 demonstrates superior performance across frameworks and task levels compared to other models, while several agents fail entirely on complex full-stack tasks involving multi-page integration.
  • Systematic weaknesses are observed in state-dependent operations such as state management and CRUD operations, whereas navigation and authentication tasks are handled more reliably.
  • Failure analysis identifies distinct gaps in fine-grained visual alignment, cross-module consistency, and long-horizon system planning, which compound as development scope expands.
  • The study validates the reliability of its evaluation pipeline, showing high agreement between the automated GUI agent verifier and human annotations, as well as strong rank consistency between the VLM-based judge and human preferences.

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp