HyperAIHyperAI

Command Palette

Search for a command to run...

SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents

Abstract

Agentic reinforcement learning (RL) holds great promise for the development of autonomous agents under complex GUI tasks, but its scalability remains severely hampered by the verification of task completion. Existing task verification is treated as a passive, post-hoc process: a verifier (i.e., rule-based scoring script, reward or critic model, and LLM-as-a-Judge) analyzes the agent's entire interaction trajectory to determine if the agent succeeds. Such processing of verbose context that contains irrelevant, noisy history poses challenges to the verification protocols and therefore leads to prohibitive cost and low reliability. To overcome this bottleneck, we propose SmartSnap, a paradigm shift from this passive, post-hoc verification to proactive, in-situ self-verification by the agent itself. We introduce the Self-Verifying Agent, a new type of agent designed with dual missions: to not only complete a task but also to prove its accomplishment with curated snapshot evidences. Guided by our proposed 3C Principles (Completeness, Conciseness, and Creativity), the agent leverages its accessibility to the online environment to perform self-verification on a minimal, decisive set of snapshots. Such evidences are provided as the sole materials for a general LLM-as-a-Judge verifier to determine their validity and relevance. Experiments on mobile tasks across model families and scales demonstrate that our SmartSnap paradigm allows training LLM-driven agents in a scalable manner, bringing performance gains up to 26.08% and 16.66% respectively to 8B and 30B models. The synergizing between solution finding and evidence seeking facilitates the cultivation of efficient, self-verifying agents with competitive performance against DeepSeek V3.1 and Qwen3-235B-A22B.

One-sentence Summary

The authors from Tencent Youtu Lab and Peking University propose SmartSnap, a paradigm shift in agentic reinforcement learning that enables LLM-driven agents to proactively curate minimal, decisive snapshot evidences—guided by Completeness, Conciseness, and Creativity principles—enabling efficient, reliable self-verification and achieving up to 26.08% performance gains over 8B models on Android tasks, with competitive results against larger models like Qwen3-235B-A22B.

Key Contributions

  • We introduce SmartSnap, a paradigm shift from passive, post-hoc verification to proactive, in-situ self-verification, where the agent autonomously curates minimal, decisive evidence snapshots to prove task completion, significantly improving scalability and reliability in GUI-based agentic reinforcement learning.

  • Guided by the 3C Principles—Completeness, Conciseness, and Creativity—the Self-Verifying Agent learns to generate high-quality evidence sets that are both sufficient and compact, enabling efficient evaluation by a general LLM-as-a-Judge verifier without relying on verbose or noisy full trajectories.

  • Experiments on the AndroidLab benchmark show that SmartSnap boosts task success rates by up to 26.08% and 16.66% for 8B and 30B models respectively, outperforming strong baselines like DeepSeek V3.1 and Qwen3-235B-A22B, while reducing verification costs through intrinsic reward shaping and structured feedback.

Introduction

The authors address a critical challenge in agentic reinforcement learning: the scalability bottleneck caused by inefficient and unreliable post-hoc verification of task completion in complex GUI environments. Prior approaches rely on passive verification, where a verifier—such as a rule-based system, reward model, or LLM-as-a-Judge—analyzes full interaction trajectories, leading to high computational costs and reduced reliability due to noisy, irrelevant context. To overcome this, the authors introduce SmartSnap, a paradigm shift toward proactive, in-situ self-verification by the agent itself. The core contribution is the Self-Verifying Agent, which operates under the 3C Principles—Completeness, Conciseness, and Creativity—to autonomously collect minimal, decisive snapshot evidence during task execution. This curated evidence is then used by a general LLM-as-a-Judge for fast, reliable validation. Experiments on mobile tasks show significant performance gains—up to 26.08% and 16.66% for 8B and 30B models—demonstrating scalable, efficient training with improved success rates and competitive performance against large models like DeepSeek V3.1 and Qwen3-235B-A22B.

Dataset

  • The dataset is derived from AndroidLab, a reproducible benchmark with 138 tasks across nine Android apps, each running on predefined Android Virtual Devices (AVDs).
  • Only the predefined tasks and their initial environments are used, excluding AndroidLab’s rule-based evaluation system to enable the Self-Verifying paradigm.
  • The observation space uses compressed XML tree representations instead of screenshots, emphasizing planning and decision-making over visual perception.
  • The action space includes native AndroidLab actions: Tap, Swipe, Type, Long Press, Home, Back, plus a custom submit tool for evidence snapshots.
  • For training, the authors replicate the 726 tasks from the Android Instruct dataset to ensure comparability across experiments.
  • The supervised fine-tuning (SFT) dataset is built using a rollout framework based on the VeRL agent loop, with two large language models—DeepSeek V3.1 and Qwen3-235B-A22B—generating trajectories across all 726 tasks.
  • Trajectories are collected with both task completion and evidence submission, ensuring diverse and high-quality data.
  • Original Android Instruct responses are updated with outputs from the two LLM agents, resulting in approximately 550K trainable QA pairs from 30K trajectories.
  • A random subset of 100K QA pairs is selected as the cold-start dataset for initial training.
  • Task distribution statistics are detailed in Table 1, which provides breakdowns across apps and task types.

Method

The authors leverage a self-verifying agent framework, formalized as an augmented Markov Decision Process, to address the challenge of reliable task completion in interactive environments. The agent operates within an environment modeled as a tuple M=(S,A,P)\mathcal{M} = (\mathcal{S}, \mathcal{A}', P)M=(S,A,P), where the augmented action space A\mathcal{A}'A is the union of execution actions Aexec\mathcal{A}_{\text{exec}}Aexec (e.g., click, type) and curation actions Acurate\mathcal{A}_{\text{curate}}Acurate (e.g., submit). The agent's policy πθ\pi_{\theta}πθ maps the history of observations, actions, and the task instruction III to a probability distribution over A\mathcal{A}'A. The agent's trajectory terminates when it selects a terminal action from Acurate\mathcal{A}_{\text{curate}}Acurate, such as submit('text', E), where text is a final message and E is the curated evidence set. This evidence is not a self-declared summary but a set of integer indices pointing to critical interaction pairs from the agent's history.

The core of the SmartSnap framework lies in the grounding of evidence in tool interactions. The authors define a single piece of evidence, an exhibit, as an atomic tuple (at,st+1)(a_t, s_{t+1})(at,st+1), representing a direct cause-and-effect event between a tool call ata_tat and its immediate observation st+1s_{t+1}st+1. This definition ensures the evidence is an objective, unalterable fact, contrasting with the untrustworthy nature of self-declared summaries. The agent's learning objective is to select a minimal yet decisive evidence set EEE from its full interaction history. This curated evidence is then programmatically formatted using the OpenAI chat template, resulting in a string representation that is fed to the verifier.

To guide the agent's evidence curation behavior, the authors formalize the 3C Principles: Completeness, Conciseness, and Creativity. These principles are embedded as meta-instructions in the agent's system prompt. Completeness requires the agent to include all pivotal tool interactions to maximize the True Positive rate of the verifier, ensuring a correctly executed task is not penalized. Conciseness aims to minimize the False Positive rate by reducing redundant information, which is crucial for the verifier's efficiency and robustness, mitigating risks like context rot. Creativity encourages the agent to take additional, evidence-oriented actions if necessary, transforming it from a passive historian into a proactive investigator. This principle allows the agent to create evidence by inspecting the environment for indirect consequences of its actions, thereby achieving a more robust and comprehensive proof of task completion.

The evidence verification and reward shaping process is designed to provide dense, structured feedback. A multi-component reward function RtotalR_{\text{total}}Rtotal is constructed from four components: RformatR_{\text{format}}Rformat penalizes formatting errors; RvalidityR_{\text{validity}}Rvalidity rewards relevant evidence; RcompleteR_{\text{complete}}Rcomplete is a binary reward based on a rigorous "success only upon unequivocal proof" rule, which includes checks for zero assumptions and traceable reasoning; and RconciseR_{\text{concise}}Rconcise penalizes the size of the evidence set. This composite reward function guides the agent's end-to-end learning. The authors adopt Group Relative Policy Optimization (GRPO) as the learning algorithm, which eliminates the need for a separate critic network by computing the advantage function relative to a group of trajectories sampled from the same policy. This approach reduces training cost and memory overhead.

Experiment

  • SmartSnap introduces a self-verifying agent paradigm that integrates task execution and evidence curation, significantly improving performance on AndroidLab across model families and scales without relying on rule-based verifiers or task-specific reward models.
  • On AndroidLab with XML mode, SmartSnap (Qwen3-8B-Instruct) and SmartSnap (Qwen3-32B-Instruct) achieve on-par success rates with DeepSeek-V3.1 and Qwen3-235B-A22B, respectively, demonstrating strong generalization across model sizes.
  • All models under investigation show performance gains exceeding 16% over vanilla prompting and fine-tuning, with consistent improvements across most app categories except Maps.me, where a knowledge gap limits progress.
  • Self-verification enables agents to curate minimal, high-quality evidence sets—averaging 1.5 evidences per task—leading to more efficient, compact interactions and reduced reliance on trial-and-error.
  • Reinforcement learning under SmartSnap consistently increases training rewards and validation accuracy, with decreasing interaction turns and response lengths, indicating improved task efficiency and policy refinement.
  • Case studies show agents learn to correct misinterpretations of UI elements, discover optimal paths (e.g., using floating buttons or search), and submit decisive evidence, demonstrating enhanced problem decomposition and self-verification capabilities.
  • Performance stagnation on complex domains like Calendar, Maps.me, and Zoom highlights the need for domain-specific knowledge injection via continual pre-training and larger, balanced training datasets.

The authors use SmartSnap to enhance the performance of LLM-driven agents on AndroidLab, achieving significant improvements in success rate across various app categories and model scales. Results show that the proposed self-verifying paradigm, particularly through reinforcement learning, enables agents to complete tasks more efficiently and generate concise, relevant evidence, leading to higher success rates compared to vanilla prompting and fine-tuning methods.

The authors use SmartSnap to enhance the performance of LLM-driven agents on AndroidLab, achieving significant improvements in success rate and evidence curation through reinforcement learning. Results show that the proposed method consistently outperforms vanilla prompting and fine-tuning across various model families and scales, with the best results obtained using the ReAct mode and Qwen3-32B-Instruct.

The authors analyze the distribution of tasks across app categories in the AndroidLab benchmark, showing that the training set is heavily dominated by tasks from the Settings app (31.26%), while the Validation set includes a more balanced distribution with notable representation from Calendar, Bluecoins, and Contacts. This imbalance suggests that model performance may be particularly influenced by the overrepresentation of certain app domains in the training data.

The authors use SmartSnap to enhance the performance of LLM-driven agents on AndroidLab, achieving significant improvements in success rate across various model families and scales. Results show that both fine-tuning and reinforcement learning with SmartSnap lead to substantial gains, with models like Qwen3-32B-Instruct and Qwen3-235B-A22B reaching performance comparable to larger models such as DeepSeek-V3.1 and GPT-4.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp