HyperAIHyperAI

Command Palette

Search for a command to run...

SmartSnap: 자가 검증 에이전트를 위한 사전적 증거 탐색

초록

에이전트 기반 강화학습(RL)은 복잡한 GUI 작업 환경에서 자율 에이전트 개발에 큰 잠재력을 지니고 있으나, 작업 완료 여부를 검증하는 과정에서의 확장성 한계로 인해 여전히 심각한 제약을 받고 있다. 기존의 작업 검증 방식은 수동적이고 사후적(prospective)인 처리 방식으로, 검증기(예: 규칙 기반 점수 산정 스크립트, 보상 모델 또는 비판 모델, LLM-기반 심사자)가 에이전트의 전체 상호작용 트래잭터리를 분석하여 성공 여부를 판단한다. 이 과정은 관련 없거나 노이즈가 많은 과거 기록을 포함한 방대한 컨텍스트를 처리해야 하므로 검증 프로토콜에 부담을 주며, 결과적으로 높은 비용과 낮은 신뢰성 문제를 야기한다. 이러한 병목 현상을 극복하기 위해, 본 연구는 수동적·사후적 검증에서 벗어나 에이전트 자체가 능동적이고 실시간(self-verification)으로 검증을 수행하는 패러다임 전환을 제안한다. 이를 위해 우리는 '자기 검증 에이전트(Self-Verifying Agent)'라는 새로운 유형의 에이전트를 제안한다. 이 에이전트는 단순한 작업 완수를 넘어서, 정제된 스크린샷 증거를 통해 성과를 입증하는 이중 임무를 수행하도록 설계되었다. 제안한 3C 원칙(완전성, 간결성, 창의성)을 기반으로, 에이전트는 온라인 환경에 대한 접근성을 활용해 최소한이면서 결정적인 스크린샷 집합을 기반으로 자기 검증을 수행한다. 이러한 증거는 일반적인 LLM-기반 심사자(General LLM-as-a-Judge)가 유효성과 관련성을 판단할 수 있는 유일한 자료로 제공된다. 다양한 모델 패밀리와 규모에 걸친 모바일 작업에 대한 실험 결과, 본 연구의 SmartSnap 패러다임은 LLM 기반 에이전트의 확장 가능한 훈련을 가능하게 하며, 8B 및 30B 모델에 각각 최대 26.08%, 16.66%의 성능 향상을 가져왔다. 문제 해결 과정과 증거 탐색의 상호 작용은 효율적이고 자기 검증 능력을 갖춘 에이전트의 개발을 촉진하며, DeepSeek V3.1 및 Qwen3-235B-A22B와 경쟁 가능한 성능을 달성한다.

One-sentence Summary

The authors from Tencent Youtu Lab and Peking University propose SmartSnap, a paradigm shift in agentic reinforcement learning that enables LLM-driven agents to proactively curate minimal, decisive snapshot evidences—guided by Completeness, Conciseness, and Creativity principles—enabling efficient, reliable self-verification and achieving up to 26.08% performance gains over 8B models on Android tasks, with competitive results against larger models like Qwen3-235B-A22B.

Key Contributions

  • We introduce SmartSnap, a paradigm shift from passive, post-hoc verification to proactive, in-situ self-verification, where the agent autonomously curates minimal, decisive evidence snapshots to prove task completion, significantly improving scalability and reliability in GUI-based agentic reinforcement learning.

  • Guided by the 3C Principles—Completeness, Conciseness, and Creativity—the Self-Verifying Agent learns to generate high-quality evidence sets that are both sufficient and compact, enabling efficient evaluation by a general LLM-as-a-Judge verifier without relying on verbose or noisy full trajectories.

  • Experiments on the AndroidLab benchmark show that SmartSnap boosts task success rates by up to 26.08% and 16.66% for 8B and 30B models respectively, outperforming strong baselines like DeepSeek V3.1 and Qwen3-235B-A22B, while reducing verification costs through intrinsic reward shaping and structured feedback.

Introduction

The authors address a critical challenge in agentic reinforcement learning: the scalability bottleneck caused by inefficient and unreliable post-hoc verification of task completion in complex GUI environments. Prior approaches rely on passive verification, where a verifier—such as a rule-based system, reward model, or LLM-as-a-Judge—analyzes full interaction trajectories, leading to high computational costs and reduced reliability due to noisy, irrelevant context. To overcome this, the authors introduce SmartSnap, a paradigm shift toward proactive, in-situ self-verification by the agent itself. The core contribution is the Self-Verifying Agent, which operates under the 3C Principles—Completeness, Conciseness, and Creativity—to autonomously collect minimal, decisive snapshot evidence during task execution. This curated evidence is then used by a general LLM-as-a-Judge for fast, reliable validation. Experiments on mobile tasks show significant performance gains—up to 26.08% and 16.66% for 8B and 30B models—demonstrating scalable, efficient training with improved success rates and competitive performance against large models like DeepSeek V3.1 and Qwen3-235B-A22B.

Dataset

  • The dataset is derived from AndroidLab, a reproducible benchmark with 138 tasks across nine Android apps, each running on predefined Android Virtual Devices (AVDs).
  • Only the predefined tasks and their initial environments are used, excluding AndroidLab’s rule-based evaluation system to enable the Self-Verifying paradigm.
  • The observation space uses compressed XML tree representations instead of screenshots, emphasizing planning and decision-making over visual perception.
  • The action space includes native AndroidLab actions: Tap, Swipe, Type, Long Press, Home, Back, plus a custom submit tool for evidence snapshots.
  • For training, the authors replicate the 726 tasks from the Android Instruct dataset to ensure comparability across experiments.
  • The supervised fine-tuning (SFT) dataset is built using a rollout framework based on the VeRL agent loop, with two large language models—DeepSeek V3.1 and Qwen3-235B-A22B—generating trajectories across all 726 tasks.
  • Trajectories are collected with both task completion and evidence submission, ensuring diverse and high-quality data.
  • Original Android Instruct responses are updated with outputs from the two LLM agents, resulting in approximately 550K trainable QA pairs from 30K trajectories.
  • A random subset of 100K QA pairs is selected as the cold-start dataset for initial training.
  • Task distribution statistics are detailed in Table 1, which provides breakdowns across apps and task types.

Method

The authors leverage a self-verifying agent framework, formalized as an augmented Markov Decision Process, to address the challenge of reliable task completion in interactive environments. The agent operates within an environment modeled as a tuple M=(S,A,P)\mathcal{M} = (\mathcal{S}, \mathcal{A}', P)M=(S,A,P), where the augmented action space A\mathcal{A}'A is the union of execution actions Aexec\mathcal{A}_{\text{exec}}Aexec (e.g., click, type) and curation actions Acurate\mathcal{A}_{\text{curate}}Acurate (e.g., submit). The agent's policy πθ\pi_{\theta}πθ maps the history of observations, actions, and the task instruction III to a probability distribution over A\mathcal{A}'A. The agent's trajectory terminates when it selects a terminal action from Acurate\mathcal{A}_{\text{curate}}Acurate, such as submit('text', E), where text is a final message and E is the curated evidence set. This evidence is not a self-declared summary but a set of integer indices pointing to critical interaction pairs from the agent's history.

The core of the SmartSnap framework lies in the grounding of evidence in tool interactions. The authors define a single piece of evidence, an exhibit, as an atomic tuple (at,st+1)(a_t, s_{t+1})(at,st+1), representing a direct cause-and-effect event between a tool call ata_tat and its immediate observation st+1s_{t+1}st+1. This definition ensures the evidence is an objective, unalterable fact, contrasting with the untrustworthy nature of self-declared summaries. The agent's learning objective is to select a minimal yet decisive evidence set EEE from its full interaction history. This curated evidence is then programmatically formatted using the OpenAI chat template, resulting in a string representation that is fed to the verifier.

To guide the agent's evidence curation behavior, the authors formalize the 3C Principles: Completeness, Conciseness, and Creativity. These principles are embedded as meta-instructions in the agent's system prompt. Completeness requires the agent to include all pivotal tool interactions to maximize the True Positive rate of the verifier, ensuring a correctly executed task is not penalized. Conciseness aims to minimize the False Positive rate by reducing redundant information, which is crucial for the verifier's efficiency and robustness, mitigating risks like context rot. Creativity encourages the agent to take additional, evidence-oriented actions if necessary, transforming it from a passive historian into a proactive investigator. This principle allows the agent to create evidence by inspecting the environment for indirect consequences of its actions, thereby achieving a more robust and comprehensive proof of task completion.

The evidence verification and reward shaping process is designed to provide dense, structured feedback. A multi-component reward function RtotalR_{\text{total}}Rtotal is constructed from four components: RformatR_{\text{format}}Rformat penalizes formatting errors; RvalidityR_{\text{validity}}Rvalidity rewards relevant evidence; RcompleteR_{\text{complete}}Rcomplete is a binary reward based on a rigorous "success only upon unequivocal proof" rule, which includes checks for zero assumptions and traceable reasoning; and RconciseR_{\text{concise}}Rconcise penalizes the size of the evidence set. This composite reward function guides the agent's end-to-end learning. The authors adopt Group Relative Policy Optimization (GRPO) as the learning algorithm, which eliminates the need for a separate critic network by computing the advantage function relative to a group of trajectories sampled from the same policy. This approach reduces training cost and memory overhead.

Experiment

  • SmartSnap introduces a self-verifying agent paradigm that integrates task execution and evidence curation, significantly improving performance on AndroidLab across model families and scales without relying on rule-based verifiers or task-specific reward models.
  • On AndroidLab with XML mode, SmartSnap (Qwen3-8B-Instruct) and SmartSnap (Qwen3-32B-Instruct) achieve on-par success rates with DeepSeek-V3.1 and Qwen3-235B-A22B, respectively, demonstrating strong generalization across model sizes.
  • All models under investigation show performance gains exceeding 16% over vanilla prompting and fine-tuning, with consistent improvements across most app categories except Maps.me, where a knowledge gap limits progress.
  • Self-verification enables agents to curate minimal, high-quality evidence sets—averaging 1.5 evidences per task—leading to more efficient, compact interactions and reduced reliance on trial-and-error.
  • Reinforcement learning under SmartSnap consistently increases training rewards and validation accuracy, with decreasing interaction turns and response lengths, indicating improved task efficiency and policy refinement.
  • Case studies show agents learn to correct misinterpretations of UI elements, discover optimal paths (e.g., using floating buttons or search), and submit decisive evidence, demonstrating enhanced problem decomposition and self-verification capabilities.
  • Performance stagnation on complex domains like Calendar, Maps.me, and Zoom highlights the need for domain-specific knowledge injection via continual pre-training and larger, balanced training datasets.

The authors use SmartSnap to enhance the performance of LLM-driven agents on AndroidLab, achieving significant improvements in success rate across various app categories and model scales. Results show that the proposed self-verifying paradigm, particularly through reinforcement learning, enables agents to complete tasks more efficiently and generate concise, relevant evidence, leading to higher success rates compared to vanilla prompting and fine-tuning methods.

The authors use SmartSnap to enhance the performance of LLM-driven agents on AndroidLab, achieving significant improvements in success rate and evidence curation through reinforcement learning. Results show that the proposed method consistently outperforms vanilla prompting and fine-tuning across various model families and scales, with the best results obtained using the ReAct mode and Qwen3-32B-Instruct.

The authors analyze the distribution of tasks across app categories in the AndroidLab benchmark, showing that the training set is heavily dominated by tasks from the Settings app (31.26%), while the Validation set includes a more balanced distribution with notable representation from Calendar, Bluecoins, and Contacts. This imbalance suggests that model performance may be particularly influenced by the overrepresentation of certain app domains in the training data.

The authors use SmartSnap to enhance the performance of LLM-driven agents on AndroidLab, achieving significant improvements in success rate across various model families and scales. Results show that both fine-tuning and reinforcement learning with SmartSnap lead to substantial gains, with models like Qwen3-32B-Instruct and Qwen3-235B-A22B reaching performance comparable to larger models such as DeepSeek-V3.1 and GPT-4.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp