HyperAI

This work introduces Smol2Operator, a lightweight vision-language model that evolves from having no GUI grounding into becoming an agentic GUI coder through a two-phase training process. The approach demonstrates how a small model like SmolVLM2-2.2B-Instruct can be systematically trained to understand and interact with graphical user interfaces using only supervised fine-tuning. The first phase focuses on instilling basic perception capabilities by training the model to link natural language instructions with precise GUI actions. This is done using the smolagents/aguvis-stage-1 dataset, which pairs screenshots with low-level action commands. A key innovation is the creation of a unified action space that standardizes diverse function formats across multiple datasets. This includes normalizing coordinates to a [0,1] range to ensure compatibility across different image resolutions, which is critical for model generalization. Extensive ablation studies were conducted to determine optimal training configurations. Results show that using 1152px image resolution with normalized coordinates yields the best performance, increasing ScreenSpot-v2 accuracy from 0% (baseline) to 41.27% after two epochs—demonstrating the model’s ability to learn visual grounding. The second phase builds on this foundation by introducing agentic reasoning. Using the smolagents/aguvis-stage-2 dataset, which contains multi-turn dialogues involving planning and decision-making, the model is fine-tuned to generate thoughtful, context-aware actions. This phase improves ScreenSpot-v2 accuracy to 61.71%, showing that reasoning enhances both action precision and task-level understanding. Importantly, the training strategy is scalable: when applied to a much smaller model (nanoVLM-460M), it achieves ~58% on ScreenSpot-v2—surpassing prior results for models of that size. This highlights the effectiveness of high-quality, structured data over raw model capacity. All components are open-sourced, including the full training recipe, preprocessing tools, datasets (smolagents/aguvis-stage-1 and smolagents/aguvis-stage-2), and the final trained model (smolagents/SmolVLM2-2.2B-Instruct-Agentic-GUI). A demo space is also available for interactive testing. The work underscores that GUI grounding is primarily driven by data quality and task design. By combining standardized action representations with reasoning-rich training data, even small models can become capable GUI agents. Future directions include exploring reinforcement learning and preference-based methods to enable real-time adaptation and more complex behavior. The open release aims to empower researchers to build upon this foundation and accelerate progress in agentic GUI systems.

Smol2Operator: Training Lightweight GUI Agents with Open-Source Tools and Reproducible Workflows

Related Links