HyperAIHyperAI

Command Palette

Search for a command to run...

로보세이프: 실행 가능한 안전 논리에 의한 신체화된 에이전트의 보호

Le Wang Zonghao Ying Xiao Yang Quanchen Zou Zhenfei Yin Tianlin Li Jian Yang Yaodong Yang Aishan Liu Xianglong Liu

초록

시각-언어 모델(VLM)로 구동되는 몸을 가진 에이전트는 점점 더 복잡한 실세계 작업을 수행할 수 있는 능력을 갖추고 있으나, 위험한 지시에 취약하여 안전하지 않은 행동을 유도할 수 있다. 작업 수행 중에 위험한 행동을 차단하는 런타임 안전 가드레일은 유연성 덕분에 유망한 해결책으로 여겨진다. 그러나 기존의 방어 방식은 주로 정적 규칙 필터나 프롬프트 수준의 제어에 의존하며, 동적이고 시간적으로 의존적이며 맥락이 � богrich한 환경에서 발생하는 암묵적인 위험을 효과적으로 다루기 어렵다. 이를 해결하기 위해 우리는 실행 가능한 조건 기반 안전 논리를 통해 몸을 가진 에이전트를 위한 하이브리드 추론 기반 런타임 보호 장치인 RoboSafe를 제안한다. RoboSafe는 하이브리드 장단기 안전 메모리 위에서 두 가지 보완적인 추론 과정을 통합한다. 먼저, 최근의 작업 경로를 단기 메모리에서 지속적으로 재검토하여 시간적 안전 조건을 추론하고, 위반 사항을 감지하면 사전에 재계획을 유도하는 역추적 추론 모듈을 제안한다. 다음으로, 장기 안전 메모리와 에이전트의 다모달 관측 정보를 바탕으로 상황 인식형 안전 조건을 생성함으로써 향후 발생할 수 있는 위험을 예측하는 전방 예측 추론 모듈을 제안한다. 이러한 두 구성 요소는 해석 가능하고 코드로 실행 가능한 적응형이며 검증 가능한 안전 논리를 형성한다. 다양한 에이전트를 대상으로 수행한 광범위한 실험 결과, RoboSafe는 최첨단 기준 대비 위험 행동을 대폭 감소시켰으며(-36.8% 위험 발생률 감소), 원래 작업 성능은 거의 유지함을 확인하였다. 실제 물리적 로봇 팔을 활용한 평가를 통해 그 실용성도 추가로 입증되었다. 코드는 논문 채택 후 공개될 예정이다.

One-sentence Summary

Researchers from Beihang University, Beijing University of Posts and Telecommunications, and et al. propose RoboSafe, a hybrid runtime safeguard using executable safety logic with backward reflective and forward predictive reasoning on hybrid memory. It dynamically detects implicit temporal hazards in embodied agent tasks through multimodal context analysis, reducing hazardous actions by 36.8% versus static rule filters while preserving task performance, validated on physical robotic arms.

Key Contributions

  • Embodied agents using vision-language models face critical safety vulnerabilities to hazardous instructions, particularly implicit contextual and temporal risks in dynamic environments that existing static rule-based or prompt-level guardrails fail to address.
  • RoboSafe introduces a hybrid reasoning runtime safeguard with executable safety logic, combining Backward Reflective Reasoning to detect temporal violations in short-term memory and Forward Predictive Reasoning to anticipate context-aware risks using long-term memory and multimodal observations.
  • Evaluated across multiple embodied agents, RoboSafe reduced hazardous actions by 36.8% compared to baselines while preserving task performance, with real-world validation confirming effectiveness on physical robotic arms.

Introduction

Embodied agents powered by vision-language models (VLMs) show strong task planning capabilities in dynamic environments but face critical safety gaps when handling hazardous real-world scenarios. Prior runtime guardrails like ThinkSafe and AgentSpec focus on detecting explicit, immediate risks in single actions but fail to identify complex implicit temporal dangers such as a stove left unattended after use. The authors leverage a hybrid bidirectional reasoning mechanism with executable safety logic to verify risks both forward in time and backward from unsafe states, enabling RoboSafe to mitigate these overlooked temporal dependencies while maintaining compatibility with existing VLM-driven agents across simulated and real robotic platforms.

Method

The authors leverage a hybrid reasoning architecture called RoboSafe to enforce runtime safety for vision-language-model-driven embodied agents. The framework operates as a guardrail that intercepts proposed actions before execution, evaluating them through two complementary reasoning modules—Forward Predictive Reasoning and Backward Reflective Reasoning—both grounded in a Hybrid Long-Short Safety Memory. This design enables the system to detect and mitigate both contextual and temporal risks dynamically, without requiring modifications to the underlying agent policy.

At the core of RoboSafe is a guardrail VLM that interacts with two memory components: a long-term safety memory ML\mathcal{M}^{L}ML, which stores structured safety experiences accumulated over time, and a short-term working memory MS\mathcal{M}^{S}MS, which maintains the recent trajectory τ\tauτ for the current instruction TTT. Before executing any action ata_tat, the guardrail first performs backward reflective reasoning over MS\mathcal{M}^{S}MS to verify temporal safety logic Ltb()L_{t}^{b}(\cdot)Ltb(), then performs forward predictive reasoning using multimodal observations oto_tot and retrieved knowledge from ML\mathcal{M}^{L}ML to verify contextual safety logic Ltf()L_{t}^{f}(\cdot)Ltf(). The final safety decision—either block or replan—is derived from the disjunction of these two logic functions:

Ltf(atot,MtL)Ltb(atΨt,MS)L_{t}^{f}(a_{t} \mid o_{t}, \mathcal{M}_{t}^{L}) \vee L_{t}^{b}(a_{t} \mid \Psi_{t}, \mathcal{M}^{S})Ltf(atot,MtL)Ltb(atΨt,MS)

. Each logic function returns a binary value: 1 indicates a detected risk, triggering an intervention.

As shown in the figure below, the Forward Predictive Reasoning module retrieves context-aware safety knowledge from ML\mathcal{M}^{L}ML using a multi-grained retrieval mechanism. It encodes both coarse-grained context (observation oto_tot, instruction TTT, and short-term memory MS\mathcal{M}^{S}MS) and fine-grained action ata_tat into semantic vectors via a text encoder ftext()f_{\text{text}}(\cdot)ftext(). The retrieval score for each memory entry miLm_i^LmiL is computed using a weighted cosine similarity that balances label frequency and context-action relevance:

S(miL)=ω(yi)[λcos(qact,kact,i)+(1λ)cos(qctx,kctx.i)]S(m_{i}^{L}) = \omega(y_{i}) \cdot [\lambda \cdot \cos(q_{act}, k_{act,i}) + (1 - \lambda) \cdot \cos(q_{ctx}, k_{ctx.i})]S(miL)=ω(yi)[λcos(qact,kact,i)+(1λ)cos(qctx,kctx.i)]

. The top-k retrieved experiences MtL\mathcal{M}_t^LMtL are then used to generate a set of executable logical predicates Φt(MtL)\Phi_t(\mathcal{M}_t^L)Φt(MtL), which are evaluated against the current observation and action. The contextual logic Ltf()L_t^f(\cdot)Ltf() is defined as the disjunction of these predicates:

Ltf(atot,MtL)=ϕΦt(MtL)ϕ(atot)L_{t}^{f}(a_{t} \mid o_{t}, \mathcal{M}_{t}^{L}) = \bigvee_{\phi \in \Phi_{t}(\mathcal{M}_{t}^{L})} \phi(a_{t} \mid o_{t})Ltf(atot,MtL)=ϕΦt(MtL)ϕ(atot)

. If any predicate evaluates to 1, the action is blocked.

The Backward Reflective Reasoning module, in contrast, focuses on temporal safety by continuously evaluating the recent trajectory stored in MS\mathcal{M}^{S}MS. At task initialization, the guardrail VLM infers a set of temporal predicates Ψ\PsiΨ from the instruction TTT, categorizing them into three verifiable types: prerequisite (ψp\psi^{p}ψp), obligation (ψo\psi^{o}ψo), and adjacency (ψa\psi^{a}ψa). Each predicate is parameterized by a trigger action, a response action, and a step window. Before executing ata_tat, the guardrail activates the subset ΨtΨ\Psi_t \subseteq \PsiΨtΨ whose trigger action matches ata_tat. It then verifies whether any predicate in Ψt\Psi_tΨt is violated based on the trajectory in MS\mathcal{M}^{S}MS:

Ltb(atΨt,MS)=ψΨtψ(atMS)L_{t}^{b}(a_{t} \mid \Psi_{t}, \mathcal{M}^{S}) = \bigvee_{\psi \in \Psi_{t}} \psi(a_{t} \mid \mathcal{M}^{S})Ltb(atΨt,MS)=ψΨtψ(atMS)

. If a violation is detected, the guardrail triggers a replan action, inserting the required response action into the agent’s plan to mitigate the temporal hazard before resuming execution.

The entire safety logic is executed via a lightweight Python interpreter, ensuring that both contextual and temporal predicates are not only interpretable but also verifiable as executable code. This enables RoboSafe to operate as a flexible, runtime guardrail that adapts to dynamic, unseen environments while preserving the agent’s task performance. The framework’s effectiveness is demonstrated through its ability to reduce hazardous actions by 36.8% compared to baselines, validated across simulated and real-world robotic platforms.

Experiment

  • On contextual unsafe instructions (Tab. 1), RoboSafe achieved 89.89% average Action Refusal Rate (ARR) and 4.78% Execution Safety Rate (ESR), surpassing baselines like ThinkSafe (7.56% avg ESR) and demonstrating precise contextual risk identification.
  • On temporal unsafe long-horizon tasks (Tab. 2), RoboSafe achieved 36.67% Success Plan Rate (SPR) and 32.00% ESR, tripling the performance of unprotected agents (10.00% SPR) and outperforming all baselines that failed to handle temporal risks.
  • On safe instructions (Tab. 3), RoboSafe maintained 89.00% ESR with minimal degradation (−7.22% vs. original), preserving task capability better than ThinkSafe (24.11% ESR) which suffered high false positives.
  • Against contextual jailbreak attacks (Tab. 4), RoboSafe suppressed ESR to 5.22%, outperforming other defenses by +45.75% due to objective observation-based reasoning.
  • Ablation studies confirmed Gemini-2.5-flash as the optimal guardrail VLM (92.33% ARR) and λ=0.6 for balanced safety and performance (90.3% safe-task ESR).
  • RoboSafe added negligible runtime overhead (Tab. 5) and successfully prevented hazardous actions in physical robotic arm tests (Fig. 6), validating real-world applicability.

RoboSafe achieves the highest success rate (SPR) and lowest error rate (ESR) across all agent architectures on long-horizon temporal unsafe tasks, significantly outperforming baseline methods. While baselines like ThinkSafe and Poex show limited effectiveness due to over-blocking or static rule limitations, RoboSafe’s reflective reasoning enables proactive replanning and safe task completion. The results confirm RoboSafe’s ability to mitigate implicit temporal hazards without compromising task execution.

The authors evaluate RoboSafe against baseline methods on a contextual unsafe instruction dataset, measuring effectiveness via ESR (lower is better). Results show RoboSafe achieves the lowest ESR across all agent architectures—4.00% for ProgPrompt, 6.33% for ReAct, and 5.33% for Reflexion—significantly outperforming baselines like ThinkSafe, Poex, AgentSpec, and GuardAgent. This demonstrates RoboSafe’s superior ability to suppress hazardous actions while maintaining contextual awareness.

The authors evaluate RoboSafe against multiple baselines on contextual unsafe instructions, showing it achieves the highest average ARR of 89.89% and lowest average ESR of 4.78%. RoboSafe significantly outperforms all baselines, including ThinkSafe, which had the next best ESR but lower ARR, indicating RoboSafe’s hybrid reasoning more accurately identifies and blocks context-aware hazards. The original agents without guardrails show near-total vulnerability, with an average ARR of only 2.33% and ESR of 84.11%.

The authors evaluate defense methods on a contextual unsafe instruction dataset, with results showing that RoboSafe achieves the lowest ESR of 0.15 for the ReAct agent, matching ThinkSafe and Poex but outperforming Original, AgentSpec, and GuardAgent. This indicates RoboSafe’s effectiveness in suppressing hazardous actions while maintaining competitive performance against strong baselines.

The authors evaluate RoboSafe against baseline methods on a detailed contextual unsafe instruction dataset, showing it achieves the highest average refusal rate (ARR) across all agent architectures, with scores of 90.33% for ProgPrompt, 88.67% for ReAct, and 88.00% for Reflexion. Results indicate RoboSafe significantly outperforms all baselines in identifying and blocking context-aware hazards while maintaining high task completion rates on safe instructions.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp