HyperAIHyperAI

Command Palette

Search for a command to run...

AgentDoG: إطار حارس تشخيصي للسلامة والأمان في الوكلاء الذكية

Abstract

يُعدّ الصعود المتسارع للوكلاء الذكية الاصطناعية مصدرًا للتحديات المعقدة المتعلقة بالسلامة والأمن الناتجة عن الاستخدام الذاتي للأدوات والتفاعل مع البيئة. تفتقر النماذج الحالية المُحكمة (guardrail) إلى الوعي بالمخاطر المرتبطة بالوكلاء، وكذلك إلى الشفافية في تشخيص هذه المخاطر. ولإدخال حماية ذكية (agentic guardrail) تغطي سلوكًا خطرًا معقدًا ومتعدد الأشكال، نقترح أولًا تصنيفًا ثلاثي الأبعاد موحدًا يصنف مخاطر الوكلاء بشكل متعامد وفقًا لمصدر الخطر (أين)، ونمط الفشل (كيف)، والنتيجة (ما). وباستخدام هذا التصنيف المهيكل والهرمي كإرشاد، نقدّم معيارًا جديدًا دقيقًا للسلامة الوكلائية (ATBench) ونُطَوِّر إطارًا جديدًا لحماية الوكلاء (AgentDoG) يهدف إلى تعزيز السلامة والأمن الوكلائي. يوفر AgentDoG مراقبة دقيقة ومتعددة السياقات على طول مسارات الوكيل. والأهم من ذلك، يمكن لـ AgentDoG تشخيص الأسباب الجذرية للإجراءات غير الآمنة، وكذلك للإجراءات التي قد تبدو آمنة لكنها غير معقولة، مما يوفر معلومات حول الأصل (provenance) والشفافية التي تتجاوز التصنيفات الثنائية، مما يُسهِّل عملية محاذاة الوكلاء بشكل فعّال. تتوفر نسخ مختلفة من AgentDoG بثلاثة أحجام (4B، 7B، و8B معاملات) ضمن عائلتي النماذج Qwen وLlama. وقد أظهرت النتائج التجريبية الواسعة أداءً متميزًا على مستوى التقنيات الحالية في مجال تنظيم السلامة الوكلائية في سيناريوهات تفاعلية متنوعة ومعقدة. تم الإفراج المفتوح عن جميع النماذج والبيانات.

One-sentence Summary

Researchers from Shanghai Artificial Intelligence Laboratory propose AgentDoG, a diagnostic guardrail framework with a three-dimensional risk taxonomy and fine-grained benchmark (ATBench), enabling root-cause diagnosis and transparent safety monitoring for AI agents across Qwen and Llama models, outperforming prior methods in complex interactive scenarios.

Key Contributions

  • Introduced a unified three-dimensional taxonomy for agentic risks—categorizing them by source, failure mode, and consequence—to systematically address safety gaps in autonomous agent behaviors beyond traditional content moderation.
  • Developed AgentDoG, a diagnostic guardrail framework that provides fine-grained, contextual monitoring and root-cause diagnosis of unsafe or unreasonable agent actions, enabling transparent alignment through provenance tracing across trajectories.
  • Demonstrated state-of-the-art performance on ATBench and other agentic benchmarks, with publicly released models (4B/7B/8B) across Qwen and Llama families, supporting reproducible research and community evaluation of agent safety.

Introduction

The authors leverage the rise of autonomous AI agents—which perform complex, multi-step tasks using tools and environmental interactions—to address emerging safety challenges that current guardrails fail to handle. Existing models like LlamaGuard3 or ShieldGemma focus on content filtering and lack awareness of agentic risks such as malicious tool use or context-dependent failures, while offering only binary safe/unsafe labels that obscure root causes. To overcome this, they introduce a three-dimensional taxonomy (risk source, failure mode, real-world harm) to systematically categorize agentic risks, then build ATBench, a fine-grained safety benchmark, and AgentDoG, a diagnostic guardrail that monitors agent trajectories and explains why unsafe or unreasonable actions occur. AgentDoG variants, available in 4B, 7B, and 8B sizes across Qwen and Llama families, achieve state-of-the-art performance in safety moderation while enabling transparent, provenance-rich diagnostics to improve agent alignment.

Dataset

The authors use a synthetically generated dataset called AgentDoG, built to train and evaluate safety guardrails for AI agents operating in tool-augmented environments. Here’s how the data is composed, processed, and applied:

  • Dataset Composition & Sources

    • Built from over 100k multi-turn agent trajectories, synthesized using a structured safety taxonomy and a large-scale tool library (~10,000 tools) derived from ToolBench and ToolAlpaca.
    • Tools cover information retrieval, computation, content processing, and API invocation — 40x larger than prior benchmarks.
    • Each trajectory includes user, assistant, and environment/tool turns, designed to reflect real-world agent behavior under risk conditions.
  • Key Subsets & Filtering Rules

    • Training Set: After synthesis, trajectories undergo a two-layer QC process:
      1. Structural validators check turn order, tool call validity, step coherence, and readability.
      2. An LLM judge verifies alignment between trajectory content and assigned safety labels across three dimensions: risk source, failure mode, and risk consequence.
      • Only 52% of generated trajectories pass QC.
    • ATBench (Evaluation Set): 500 held-out trajectories (250 safe, 250 unsafe), with no tool overlap with training data (uses 2,292 unseen tools).
      • Average 8.97 turns per trajectory, covering 1,575 unique tools.
      • Labeled via multi-model voting (Qwen, GPT-5.2, Gemini 3 Pro, DeepSeek-V3.2) + human adjudication for ambiguous cases.
      • Split into Easy (273, unanimous model verdict) and Hard (227, conflicting verdicts) subsets for stratified evaluation.
  • How the Data Is Used in Training & Evaluation

    • Training data is mixture-weighted to ensure balanced coverage across 8 risk sources, 14 failure modes, and 10 harm categories.
    • The model is trained to predict both binary safety (safe/unsafe) and fine-grained taxonomy labels for unsafe cases.
    • ATBench is used exclusively for evaluation, measuring both overall safety classification (Accuracy, Precision, Recall, F1) and fine-grained diagnosis accuracy (Risk Source, Failure Mode, Harm Type).
  • Processing & Metadata Details

    • Trajectories are filtered for structural integrity, semantic coherence, and label alignment.
    • For ATBench, each trajectory is scored (1–5) for quality before labeling; only scores ≥3 are retained.
    • Metadata includes taxonomy labels per unsafe trajectory, assigned via majority vote across four LLMs, with human review for ties.
    • No cropping is applied — full trajectories are preserved to evaluate long-horizon decision chains.
    • Binary safety verdicts include cases where agents successfully refuse or mitigate risks, not just when harm occurs.

Method

The authors introduce a framework for trajectory-level safety diagnosis in autonomous agents, which extends beyond traditional end-point safety evaluation to identify unsafe behavior at any stage of an agent's execution. The core of the approach is a three-dimensional safety taxonomy that categorizes risks along three axes: risk source, failure mode, and real-world harm. This taxonomy, as illustrated in the figure below, provides a structured foundation for both data synthesis and model evaluation. The risk source dimension identifies the origin of the risk, such as malicious user input or corrupted tool feedback. The failure mode dimension describes how the agent fails, including behavioral failures like flawed planning or output content failures like generating harmful content. The real-world harm dimension quantifies the potential consequences, ranging from privacy violations to physical harm. This comprehensive taxonomy enables a systematic and fine-grained analysis of agent behavior.

The framework is designed to support two primary tasks: trajectory-level safety evaluation and fine-grained risk diagnosis. For trajectory-level evaluation, the model is tasked with determining whether any step in a given agent trajectory contains unsafe behavior, resulting in a binary label. For fine-grained diagnosis, the model must predict a triplet of labels corresponding to the risk source, failure mode, and real-world harm. The prompt templates for these tasks, as shown in the figure below, are structured to guide the model through the analysis. The trajectory-level evaluation prompt includes a task definition, the agent trajectory, and an output format instruction. The fine-grained diagnosis prompt is similar but includes the safety taxonomy as a reference to ensure accurate categorization. The model is instructed to output the results in a specific format, with the fine-grained diagnosis requiring a line-by-line output of the predicted labels.

To generate high-quality training data that covers a broad spectrum of risk scenarios, the authors employ a multi-agent pipeline for data synthesis, as depicted in the figure below. This pipeline operates in three stages. In Stage 1, Planning, a risk configuration is sampled from the taxonomy, and a task plan is constructed. This plan includes a coherent multi-step task, a sequence of tool-augmented steps, and a designated risk injection point. In Stage 2, Trajectory Synthesis, an Orchestrator drives the execution of the plan. It generates user queries, simulates tool interactions, and generates agent responses. At the designated risk point, the tool response generator injects malicious content to expose the agent to a controlled risk. The agent's response is then generated based on the intended safety outcome, either detecting and mitigating the threat (safe trajectory) or failing to do so (unsafe trajectory). Stage 3, Filtering, applies a dual-layer validation process to ensure the quality of the generated trajectories. A Rule Checker verifies the technical correctness of tool calls, while a Model Checker assesses the rationality, coherence, and factual plausibility of the agent's behavior. This process ensures that the final dataset contains only high-quality, realistic trajectories.

The training process for the safety guard models is based on standard supervised fine-tuning (SFT). The models are trained on a dataset of agent trajectories paired with their corresponding safety labels. The objective is to minimize the negative log-likelihood loss, which encourages the model to predict the correct safety label for a given trajectory. The authors fine-tune several large language models, including Qwen3-4B-Instruct, Qwen2.5-7B-Instruct, and Llama3.1-8B-Instruct, using a learning rate of 1e-5. This training approach allows the models to learn to classify trajectories as safe or unsafe and to perform fine-grained risk diagnosis based on the structured data generated by the synthesis pipeline.

Experiment

  • AgentDoG excels in trajectory-level safety evaluation, achieving 92.7% F1 on R-Judge (surpassing GPT-5.2’s 91.8%) and 83.4% F1 on ASSE-Safety (outperforming Gemini-3-Pro’s 78.6%), while maintaining balanced precision/recall versus conservative guard models.
  • In fine-grained risk diagnosis on ATBench, AgentDoG-Qwen3-FG-4B achieves 82.0% Risk Source Acc, 32.4% Failure Mode Acc, and 58.4% Real-world Harm Acc, significantly outperforming general models like Gemini-3-Pro (36.8% Risk Source) and Qwen3-235B (38.0% Real-world Harm).
  • AgentDoG’s attribution module accurately localizes root causes in adversarial cases: identifies deceptive prompt injection in resume screening, exposes shallow keyword reliance in sarcastic financial analysis, and pinpoints flawed assumptions in ambiguous transactions—outperforming base models in causal alignment.
  • AgentDoG demonstrates robust safety handling in real-world scenarios: successfully detects and refuses prompt injection (safe case) but also reveals failure chains where indirect injection triggers goal drift and unauthorized actions (unsafe case with labeled risk taxonomy).

Results show that AgentDoG achieves the highest performance across all fine-grained risk diagnosis metrics, with 82.0% accuracy in Risk Source, 58.4% in Real-world Harm, and 32.4% in Failure Mode, outperforming all baseline models. The model demonstrates a clear advantage in identifying specific risk sources and failure modes, particularly in Risk Source classification, where it significantly surpasses the next best model, Qwen3-235B, by 24.8 percentage points.

The authors use a comprehensive evaluation framework to assess AgentDoG's performance in trajectory-level safety and fine-grained risk diagnosis across multiple benchmarks. Results show that AgentDoG achieves strong performance, outperforming most specialized guard models and remaining competitive with larger general models, particularly excelling in F1 scores on R-Judge and ASSE-Safety, while also demonstrating superior fine-grained diagnosis accuracy on ATBench.

The authors use the table to present a distribution analysis of risk sources, failure modes, and real-world harms across agent trajectories in their evaluation. The data shows that unreliable/misinformation and indirect prompt injection are the most frequent risk sources, while functional and opportunity harm is the most common real-world consequence, indicating a significant prevalence of safety issues related to information integrity and system-level failures.

The authors use a comprehensive evaluation framework to assess AgentDoG's safety capabilities across trajectory-level classification and fine-grained risk diagnosis tasks. Results show that AgentDoG achieves strong performance, outperforming most specialized guard models and remaining competitive with larger general models, particularly excelling in fine-grained risk diagnosis with high accuracy in identifying risk sources, failure modes, and real-world harms.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp