Exploring Guardrails, Evaluation, and Monitoring in Agentic AI Development
In the second installment of the "Agentic AI 102" series, the focus shifts from the basics of building AI agents to advanced topics like guardrails, agent evaluation, and monitoring. These elements are crucial for ensuring that AI agents operate safely and effectively, especially when handling sensitive topics and complex tasks. Guardrails Guardrails are essential safety mechanisms that prevent large language models (LLMs) from responding to certain topics or taking specific actions. This is particularly important because LLMs are trained on vast amounts of text data, which can sometimes include harmful or inaccurate information. Guardrails AI, a prominent framework, provides a comprehensive hub of predefined rules that can be easily integrated into your AI agents. To set up guardrails using Guardrails AI, you start by obtaining an API key from their website and installing the necessary package. For instance, the RestrictToTopic guardrail can be installed via the command line and configured to restrict the agent's responses to only predefined topics (e.g., sports, weather) and block others (e.g., stocks). This ensures that the agent does not provide potentially dangerous advice, such as financial or medical recommendations. Here's a simplified example of how to implement the RestrictToTopic guardrail: Obtain API Key: Visit the Guardrails AI hub and generate an API key. Install Package: pip install guardrails-ai Configure Guardrails: ```python from guardrails import Guard from guardrails.hub import RestrictToTopic guard = Guard().use( RestrictToTopic( valid_topics=["sports", "weather"], invalid_topics=["stocks"], disable_classifier=True, disable_llm=False, on_fail="filter" ) ) ``` Run and Validate Agent: ```python from agno.agent import Agent from agno.models.google import Gemini import os agent = Agent( model= Gemini(id="gemini-1.5-flash", api_key=os.environ.get("GEMINI_API_KEY")), description="An assistant agent", instructions=["Be succinct. Reply in maximum two sentences"], markdown=True ) response = agent.run("What's the ticker symbol for Apple?").content validation_step = guard.validate(response) if validation_step.validation_passed: print(response) else: print("Validation Failed", validation_step.validation_summaries[0].failure_reason) ``` When the agent is asked about a stock symbol, the validation fails due to the "stocks" topic being restricted. Similarly, an unrelated query like "What's the number one soda drink?" is also blocked. However, a valid query about sports, such as "Who is Michael Jordan?", produces a safe and relevant response. Agent Evaluation Evaluating the performance of AI agents is a nuanced process, unlike traditional data science models where metrics are more straightforward. The community has responded to this challenge by developing the deepeval library, which offers several methods to assess LLMs and AI agents. One basic method is G-Eval, which uses another AI model to evaluate the clarity, relevance, and correctness of an agent's responses. For instance, the agent can be asked to describe the weather in New York City for May. The evaluation process involves setting up a test case, defining a metric, and running the evaluation: ```python from deepeval.test_case import LLMTestCase from deepeval.metrics import GEval test_case = LLMTestCase(input="Describe the weather in NYC for May", actual_output=response) coherence_metric = GEval(name="Coherence", criteria="Coherence. The agent can answer the prompt and the response makes sense.", evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]) coherence_metric.measure(test_case) print(coherence_metric.score) print(coherence_metric.reason) ``` The G-Eval score of 0.9 indicates that the response is highly coherent, addressing the prompt accurately and maintaining logical flow, though it could be slightly more detailed. Another method, TaskCompletionMetric, evaluates how well an agent completes a given task. For example, an agent can be tasked with summarizing the main points from a Wikipedia search on "Time series analysis": python from deepeval.metrics import TaskCompletionMetric metric = TaskCompletionMetric(threshold=0.7, model="gpt-4o-mini", include_reason=True) test_case = LLMTestCase(input="Search Wikipedia for 'Time series analysis' and summarize the 3 main points", actual_output=response.content, tools_called=[ToolCall(name="wikipedia")]) evaluate(test_cases=[test_case], metrics=[metric]) The agent's response is evaluated based on criteria such as task alignment, clarity, and completeness. In this case, the agent achieved a perfect score of 1.0, successfully summarizing the key points of time series analysis. Agent Monitoring Agno's framework includes a built-in monitoring app that allows developers to track the performance and resource usage of their AI agents. This feature is particularly valuable for identifying bottlenecks and optimizing efficiency. To set up monitoring, you need to obtain an API key from the Agno dashboard and configure the agent to use it: Get API Key: Visit Agno's settings page and generate an API key. Setup Monitoring: bash agno setup Enable Monitoring in the Agent: ```python from agno.agent import Agent from agno.models.google import Gemini from agno.tools.wikipedia import WikipediaTools from agno.tools.file import FileTools from agno.tools.googlesearch import GoogleSearchTools agent = Agent( model= Gemini(id="gemini-1.5-flash", api_key=os.environ.get("GEMINI_API_KEY")), description="You are a social media marketer specialized in creating engaging content.", tools=[FileTools(save_files=True), GoogleSearchTools()], expected_output="A short post for Instagram and a prompt for a picture related to the content of the post.", show_tool_calls=True, monitoring=True ) ``` Run the Agent and Monitor Performance: After running the agent, you can view the session details in the Agno Dashboard, which provides insights into token consumption and task timings. For example, an agent tasked with creating an Instagram post on healthy eating can be monitored to ensure it performs efficiently: python agent.print_response("Write a short post for Instagram with tips and tricks that positions me as an authority in healthy eating.") The Agno Dashboard provides a visual representation of the agent's token usage and task execution times, helping developers identify areas for improvement. Industry Insights and Company Profiles Industry experts highlight the importance of guardrails, evaluation, and monitoring in ensuring the reliability and safety of AI agents. Guardrails AI, developed by Tryo Labs, is praised for its robust and flexible rule-based system. The deepeval library, created by Confident AI, is celebrated for its comprehensive evaluation methods, making it easier to scale and refine AI models. Agno, known for its user-friendly framework, offers powerful monitoring tools that are essential for optimizing agent performance. Guardrails AI and deepeval are vital tools for developers navigating the challenges of AI safety and performance, while Agno's monitoring app provides a practical solution for tracking and improving agent efficiency. Together, these resources form a strong foundation for building trustworthy and effective AI agents. For readers interested in exploring these topics further, the author's website and GitHub repository offer additional resources and tutorials. References: - Guardrails AI - Guardrails AI Hub - DeepEval - Confident AI - LLM Evaluation Metrics - Agentic AI 101