HyperAIHyperAI

Command Palette

Search for a command to run...

AprielGuard: A Unified Safety and Adversarial Robustness Framework for Modern LLM Systems

AprielGuard is a safety and security safeguard model designed to address the growing complexity of risks in modern large language model (LLM) systems. As LLMs evolve into agentic systems capable of multi-step reasoning, tool use, memory management, and code execution, they face increasingly sophisticated threats such as multi-turn jailbreaks, prompt injections, memory hijacking, and tool manipulation. AprielGuard, an 8-billion-parameter model, is built to detect 16 categories of safety risks—including toxicity, hate speech, sexual content, misinformation, self-harm, illegal activities, and privacy violations—along with a broad spectrum of adversarial attacks, including chain-of-thought corruption, context hijacking, and multi-agent exploit sequences. It is also designed to identify safety and security issues within agentic workflows, such as tool calls, reasoning traces, and memory states. The model supports two operational modes: reasoning and non-reasoning. The reasoning mode provides explainable, step-by-step classification, useful for debugging and compliance, while the non-reasoning mode offers low-latency performance ideal for production environments. AprielGuard is designed to work across three input formats: standalone prompts, multi-turn conversations, and full agentic workflows, making it suitable for real-world deployment in complex AI systems. The model is trained on a large-scale synthetic dataset generated using Mixtral-8x7B and internal uncensored models, with high-temperature prompting to increase output diversity. Training data is created at a sub-topic level to ensure comprehensive coverage. Adversarial attacks are generated through a combination of synthetic data, diverse prompt templates, and rule-based methods, with NVIDIA NeMo Curator used to create large-scale, multi-turn conversational datasets featuring evolving attack patterns. The SyGra framework is also employed for generating harmful prompts and attack sequences. The dataset includes various text formats such as dialogues, forum posts, tweets, and instructional content. To improve robustness, the training data undergoes extensive augmentation, including character-level noise, leetspeak substitutions, typographical errors, word-level paraphrasing, and syntactic reordering. This helps the model generalize and resist manipulation through superficial input variations. Agentic workflow data is generated with realistic scenarios involving tool calls, memory states, and reasoning traces, with adversarial elements introduced into different components like prompts, tool parameters, and memory. Long-context evaluation data, up to 32,000 tokens, simulates real-world use cases such as RAG pipelines, incident reports, and extended conversations, testing the model’s ability to detect subtle, hidden threats. AprielGuard is based on a downscaled version of the Apriel-1.5 Thinker Base model, trained with bfloat16 precision, a learning rate of 2e-4, and 3 epochs, with a batch size of 1 and gradient accumulation of 8. It supports up to 32k token sequences and can be enabled for reasoning via instruction templates. Evaluation results show strong performance across public safety and adversarial benchmarks. On safety benchmarks, AprielGuard achieves high precision and recall, including near-perfect scores on SimpleSafetyTests and Aegis-AI datasets. On adversarial benchmarks, it demonstrates robust detection, particularly against jailbreaks and prompt injections, with high F1 scores and low false positive rates. In agentic workflow evaluations, it effectively identifies risks across planning, reasoning, execution, and response stages. Long-context evaluations confirm its ability to detect threats embedded within lengthy, complex texts. Multilingual evaluation extends testing to French, German, Japanese, Dutch, Spanish, Portuguese-Brazilian, Italian, and French-Canadian using the MADLAD400-3B-MT translation model. Role identifiers like User: and Assistant: are preserved during translation to maintain contextual accuracy. Performance remains strong across languages, though thorough calibration is advised before production use in non-English settings. Despite its strengths, AprielGuard has limitations. It is primarily trained on English data, with limited testing in non-English contexts. It may still be vulnerable to novel or highly complex adversarial strategies. Performance may degrade in highly specialized domains such as legal or medical fields due to nuanced contextual demands. Reasoning mode introduces latency and computational overhead, making non-reasoning mode preferable for high-throughput applications. Occasional inconsistencies between reasoning and non-reasoning modes may occur. AprielGuard is intended solely as a safeguard model for risk detection and classification according to its unified taxonomy. Deviating from its prescribed use may result in unreliable or unsafe behavior. It represents a significant step toward scalable, unified safety infrastructure for next-generation agentic AI systems.

Related Links