Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation

In recent years, Multimodal Large Language Models (MLLMs) have beenextensively utilized for multimodal reasoning tasks, including Graphical UserInterface (GUI) automation. Unlike general offline multimodal tasks, GUIautomation is executed in online interactive environments, necessitatingstep-by-step decision-making based on real-time status of the environment. Thistask has a lower tolerance for decision-making errors at each step, as anymistakes may cumulatively disrupt the process and potentially lead toirreversible outcomes like deletions or payments. To address these issues, weintroduce a pre-operative critic mechanism that provides effective feedbackprior to the actual execution, by reasoning about the potential outcome andcorrectness of actions. Specifically, we propose a Suggestion-aware GradientRelative Policy Optimization (S-GRPO) strategy to construct our pre-operativecritic model GUI-Critic-R1, incorporating a novel suggestion reward to enhancethe reliability of the model's feedback. Furthermore, we develop areasoning-bootstrapping based data collection pipeline to create aGUI-Critic-Train and a GUI-Critic-Test, filling existing gaps in GUI criticdata. Static experiments on the GUI-Critic-Test across both mobile and webdomains reveal that our GUI-Critic-R1 offers significant advantages in criticaccuracy compared to current MLLMs. Dynamic evaluation on GUI automationbenchmark further highlights the effectiveness and superiority of our model, asevidenced by improved success rates and operational efficiency.