3 days ago

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, Xiangyu Yue

Abstract

Automating the transformation of user interface (UI) designs into front-endcode holds significant promise for accelerating software development anddemocratizing design workflows. While recent large language models (LLMs) havedemonstrated progress in text-to-code generation, many existing approaches relysolely on natural language prompts, limiting their effectiveness in capturingspatial layout and visual design intent. In contrast, UI development inpractice is inherently multimodal, often starting from visual sketches ormockups. To address this gap, we introduce a modular multi-agent framework thatperforms UI-to-code generation in three interpretable stages: grounding,planning, and generation. The grounding agent uses a vision-language model todetect and label UI components, the planning agent constructs a hierarchicallayout using front-end engineering priors, and the generation agent producesHTML/CSS code via adaptive prompt-based synthesis. This design improvesrobustness, interpretability, and fidelity over end-to-end black-box methods.Furthermore, we extend the framework into a scalable data engine thatautomatically produces large-scale image-code pairs. Using these syntheticexamples, we fine-tune and reinforce an open-source VLM, yielding notable gainsin UI understanding and code quality. Extensive experiments demonstrate thatour approach achieves state-of-the-art performance in layout accuracy,structural coherence, and code correctness. Our code is made publicly availableat https://github.com/leigest519/ScreenCoder.