HyperAIHyperAI

Command Palette

Search for a command to run...

AgentDoG: 인공지능 에이전트의 안전성 및 보안을 위한 진단용 가드레일 프레임워크

초록

AI 에이전트의 부상은 자율적인 도구 사용과 환경 상호작용으로 인해 복잡한 안전성 및 보안 도전 과제를 야기하고 있다. 현재의 보호장치 모델은 에이전트 위험에 대한 인식이 부족하고, 위험 진단의 투명성이 결여되어 있다. 복잡하고 다양한 위험 행동을 포괄할 수 있는 에이전트 기반 보호장치를 도입하기 위해, 우리는 먼저 위험의 원인(어디서), 고장 유형(어떻게), 결과(무엇이 일어나는가)를 정사각형으로 분류하여 에이전트 위험을 종합적으로 체계화한 삼차원 분류 체계를 제안한다. 이러한 구조적이고 계층적인 분류 체계를 기반으로, 새로운 세밀한 수준의 에이전트 안전성 기준(ATBench)과 에이전트 안전성 및 보안을 위한 진단형 보호장치 프레임워크(AgentDoG)를 도입한다. AgentDoG는 에이전트의 행동 궤적 전반에 걸쳐 세밀하고 맥락 기반의 모니터링을 제공한다. 더욱 중요한 점은, AgentDoG가 안전하지 않은 행동뿐 아니라 보이기만 하지는 않지만 타당하지 않은 행동의 근본 원인을 진단할 수 있다는 점이다. 이는 단순한 이진 레이블을 넘어서, 행동의 근원과 투명성을 제공함으로써 효과적인 에이전트 정렬을 가능하게 한다. AgentDoG의 다양한 변형 모델은 Qwen 및 Llama 모델 계열에서 각각 4B, 7B, 8B 파라미터 규모로 제공된다. 광범위한 실험 결과는 AgentDoG가 다양한 복잡한 상호작용 환경에서 에이전트 안전성 조절 측면에서 최신 기술 수준의 성능을 달성함을 입증한다. 모든 모델과 데이터셋은 공개적으로 배포된다.

One-sentence Summary

Researchers from Shanghai Artificial Intelligence Laboratory propose AgentDoG, a diagnostic guardrail framework with a three-dimensional risk taxonomy and fine-grained benchmark (ATBench), enabling root-cause diagnosis and transparent safety monitoring for AI agents across Qwen and Llama models, outperforming prior methods in complex interactive scenarios.

Key Contributions

  • Introduced a unified three-dimensional taxonomy for agentic risks—categorizing them by source, failure mode, and consequence—to systematically address safety gaps in autonomous agent behaviors beyond traditional content moderation.
  • Developed AgentDoG, a diagnostic guardrail framework that provides fine-grained, contextual monitoring and root-cause diagnosis of unsafe or unreasonable agent actions, enabling transparent alignment through provenance tracing across trajectories.
  • Demonstrated state-of-the-art performance on ATBench and other agentic benchmarks, with publicly released models (4B/7B/8B) across Qwen and Llama families, supporting reproducible research and community evaluation of agent safety.

Introduction

The authors leverage the rise of autonomous AI agents—which perform complex, multi-step tasks using tools and environmental interactions—to address emerging safety challenges that current guardrails fail to handle. Existing models like LlamaGuard3 or ShieldGemma focus on content filtering and lack awareness of agentic risks such as malicious tool use or context-dependent failures, while offering only binary safe/unsafe labels that obscure root causes. To overcome this, they introduce a three-dimensional taxonomy (risk source, failure mode, real-world harm) to systematically categorize agentic risks, then build ATBench, a fine-grained safety benchmark, and AgentDoG, a diagnostic guardrail that monitors agent trajectories and explains why unsafe or unreasonable actions occur. AgentDoG variants, available in 4B, 7B, and 8B sizes across Qwen and Llama families, achieve state-of-the-art performance in safety moderation while enabling transparent, provenance-rich diagnostics to improve agent alignment.

Dataset

The authors use a synthetically generated dataset called AgentDoG, built to train and evaluate safety guardrails for AI agents operating in tool-augmented environments. Here’s how the data is composed, processed, and applied:

  • Dataset Composition & Sources

    • Built from over 100k multi-turn agent trajectories, synthesized using a structured safety taxonomy and a large-scale tool library (~10,000 tools) derived from ToolBench and ToolAlpaca.
    • Tools cover information retrieval, computation, content processing, and API invocation — 40x larger than prior benchmarks.
    • Each trajectory includes user, assistant, and environment/tool turns, designed to reflect real-world agent behavior under risk conditions.
  • Key Subsets & Filtering Rules

    • Training Set: After synthesis, trajectories undergo a two-layer QC process:
      1. Structural validators check turn order, tool call validity, step coherence, and readability.
      2. An LLM judge verifies alignment between trajectory content and assigned safety labels across three dimensions: risk source, failure mode, and risk consequence.
      • Only 52% of generated trajectories pass QC.
    • ATBench (Evaluation Set): 500 held-out trajectories (250 safe, 250 unsafe), with no tool overlap with training data (uses 2,292 unseen tools).
      • Average 8.97 turns per trajectory, covering 1,575 unique tools.
      • Labeled via multi-model voting (Qwen, GPT-5.2, Gemini 3 Pro, DeepSeek-V3.2) + human adjudication for ambiguous cases.
      • Split into Easy (273, unanimous model verdict) and Hard (227, conflicting verdicts) subsets for stratified evaluation.
  • How the Data Is Used in Training & Evaluation

    • Training data is mixture-weighted to ensure balanced coverage across 8 risk sources, 14 failure modes, and 10 harm categories.
    • The model is trained to predict both binary safety (safe/unsafe) and fine-grained taxonomy labels for unsafe cases.
    • ATBench is used exclusively for evaluation, measuring both overall safety classification (Accuracy, Precision, Recall, F1) and fine-grained diagnosis accuracy (Risk Source, Failure Mode, Harm Type).
  • Processing & Metadata Details

    • Trajectories are filtered for structural integrity, semantic coherence, and label alignment.
    • For ATBench, each trajectory is scored (1–5) for quality before labeling; only scores ≥3 are retained.
    • Metadata includes taxonomy labels per unsafe trajectory, assigned via majority vote across four LLMs, with human review for ties.
    • No cropping is applied — full trajectories are preserved to evaluate long-horizon decision chains.
    • Binary safety verdicts include cases where agents successfully refuse or mitigate risks, not just when harm occurs.

Method

The authors introduce a framework for trajectory-level safety diagnosis in autonomous agents, which extends beyond traditional end-point safety evaluation to identify unsafe behavior at any stage of an agent's execution. The core of the approach is a three-dimensional safety taxonomy that categorizes risks along three axes: risk source, failure mode, and real-world harm. This taxonomy, as illustrated in the figure below, provides a structured foundation for both data synthesis and model evaluation. The risk source dimension identifies the origin of the risk, such as malicious user input or corrupted tool feedback. The failure mode dimension describes how the agent fails, including behavioral failures like flawed planning or output content failures like generating harmful content. The real-world harm dimension quantifies the potential consequences, ranging from privacy violations to physical harm. This comprehensive taxonomy enables a systematic and fine-grained analysis of agent behavior.

The framework is designed to support two primary tasks: trajectory-level safety evaluation and fine-grained risk diagnosis. For trajectory-level evaluation, the model is tasked with determining whether any step in a given agent trajectory contains unsafe behavior, resulting in a binary label. For fine-grained diagnosis, the model must predict a triplet of labels corresponding to the risk source, failure mode, and real-world harm. The prompt templates for these tasks, as shown in the figure below, are structured to guide the model through the analysis. The trajectory-level evaluation prompt includes a task definition, the agent trajectory, and an output format instruction. The fine-grained diagnosis prompt is similar but includes the safety taxonomy as a reference to ensure accurate categorization. The model is instructed to output the results in a specific format, with the fine-grained diagnosis requiring a line-by-line output of the predicted labels.

To generate high-quality training data that covers a broad spectrum of risk scenarios, the authors employ a multi-agent pipeline for data synthesis, as depicted in the figure below. This pipeline operates in three stages. In Stage 1, Planning, a risk configuration is sampled from the taxonomy, and a task plan is constructed. This plan includes a coherent multi-step task, a sequence of tool-augmented steps, and a designated risk injection point. In Stage 2, Trajectory Synthesis, an Orchestrator drives the execution of the plan. It generates user queries, simulates tool interactions, and generates agent responses. At the designated risk point, the tool response generator injects malicious content to expose the agent to a controlled risk. The agent's response is then generated based on the intended safety outcome, either detecting and mitigating the threat (safe trajectory) or failing to do so (unsafe trajectory). Stage 3, Filtering, applies a dual-layer validation process to ensure the quality of the generated trajectories. A Rule Checker verifies the technical correctness of tool calls, while a Model Checker assesses the rationality, coherence, and factual plausibility of the agent's behavior. This process ensures that the final dataset contains only high-quality, realistic trajectories.

The training process for the safety guard models is based on standard supervised fine-tuning (SFT). The models are trained on a dataset of agent trajectories paired with their corresponding safety labels. The objective is to minimize the negative log-likelihood loss, which encourages the model to predict the correct safety label for a given trajectory. The authors fine-tune several large language models, including Qwen3-4B-Instruct, Qwen2.5-7B-Instruct, and Llama3.1-8B-Instruct, using a learning rate of 1e-5. This training approach allows the models to learn to classify trajectories as safe or unsafe and to perform fine-grained risk diagnosis based on the structured data generated by the synthesis pipeline.

Experiment

  • AgentDoG excels in trajectory-level safety evaluation, achieving 92.7% F1 on R-Judge (surpassing GPT-5.2’s 91.8%) and 83.4% F1 on ASSE-Safety (outperforming Gemini-3-Pro’s 78.6%), while maintaining balanced precision/recall versus conservative guard models.
  • In fine-grained risk diagnosis on ATBench, AgentDoG-Qwen3-FG-4B achieves 82.0% Risk Source Acc, 32.4% Failure Mode Acc, and 58.4% Real-world Harm Acc, significantly outperforming general models like Gemini-3-Pro (36.8% Risk Source) and Qwen3-235B (38.0% Real-world Harm).
  • AgentDoG’s attribution module accurately localizes root causes in adversarial cases: identifies deceptive prompt injection in resume screening, exposes shallow keyword reliance in sarcastic financial analysis, and pinpoints flawed assumptions in ambiguous transactions—outperforming base models in causal alignment.
  • AgentDoG demonstrates robust safety handling in real-world scenarios: successfully detects and refuses prompt injection (safe case) but also reveals failure chains where indirect injection triggers goal drift and unauthorized actions (unsafe case with labeled risk taxonomy).

Results show that AgentDoG achieves the highest performance across all fine-grained risk diagnosis metrics, with 82.0% accuracy in Risk Source, 58.4% in Real-world Harm, and 32.4% in Failure Mode, outperforming all baseline models. The model demonstrates a clear advantage in identifying specific risk sources and failure modes, particularly in Risk Source classification, where it significantly surpasses the next best model, Qwen3-235B, by 24.8 percentage points.

The authors use a comprehensive evaluation framework to assess AgentDoG's performance in trajectory-level safety and fine-grained risk diagnosis across multiple benchmarks. Results show that AgentDoG achieves strong performance, outperforming most specialized guard models and remaining competitive with larger general models, particularly excelling in F1 scores on R-Judge and ASSE-Safety, while also demonstrating superior fine-grained diagnosis accuracy on ATBench.

The authors use the table to present a distribution analysis of risk sources, failure modes, and real-world harms across agent trajectories in their evaluation. The data shows that unreliable/misinformation and indirect prompt injection are the most frequent risk sources, while functional and opportunity harm is the most common real-world consequence, indicating a significant prevalence of safety issues related to information integrity and system-level failures.

The authors use a comprehensive evaluation framework to assess AgentDoG's safety capabilities across trajectory-level classification and fine-grained risk diagnosis tasks. Results show that AgentDoG achieves strong performance, outperforming most specialized guard models and remaining competitive with larger general models, particularly excelling in fine-grained risk diagnosis with high accuracy in identifying risk sources, failure modes, and real-world harms.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp
AgentDoG: 인공지능 에이전트의 안전성 및 보안을 위한 진단용 가드레일 프레임워크 | 문서 | HyperAI초신경