HyperAIHyperAI

Command Palette

Search for a command to run...

Current Reasoning AI Models Struggle to Manipulate Their Own Thought Processes, Strengthening Safety Monitoring

As artificial intelligence agents evolve to perform increasingly complex and autonomous tasks, maintaining reliable oversight becomes critical. Consistent with iterative deployment principles, our safety strategy employs a defense-in-depth approach, integrating layers such as safety training, behavioral testing, agentic code review, and Chain-of-Thought (CoT) monitoring. CoT monitoring analyzes the intermediate reasoning steps agents generate to identify unsafe or misaligned behaviors. Current findings indicate that models are generally interpretable and easy to monitor. However, a potential future risk exists: if agents learn to deliberately obscure their reasoning when they detect monitoring, evaluations could falsely suggest high safety levels, rendering oversight ineffective. To address this, we investigated whether current reasoning models can control their CoTs to reduce monitorability. We introduced CoT-Control, an open-source evaluation suite comprising over 13,000 tasks derived from established benchmarks including GPQA, MMLU-Pro, HLE, BFCL, and SWE-Bench Verified. These tasks pair standard problems with specific instructions affecting the internal reasoning process, such as using only lowercase letters, avoiding certain keywords, or appending specific phrases. This approach serves as a practical proxy for measuring a model's ability to manipulate its reasoning under realistic incentives to evade detection. We evaluated 13 reasoning models, including frontier systems, and found that all struggled to reliably control their CoTs. Controllability was measured as the fraction of reasoning traces where the model successfully complied with internal instructions. Across the board, scores remained low, with the highest performance reaching only 15.4% for any tested frontier model. Crucially, this failure stems not from a lack of understanding but from an inability to execute the instructions even after the model recognizes its non-compliance. This low controllability is a positive development for safety; it suggests that current models lack the capacity to effectively obfuscate their reasoning to evade monitors. However, the results highlight nuanced trends regarding model architecture and training. While controllability tends to increase slightly with model scale, it degrades significantly as reasoning duration extends and as models undergo additional reinforcement learning (RL) post-training. For instance, one model family showed a tenfold drop in controllability scores following extended training. Furthermore, attempts to optimize prompts for higher controllability yielded gains that failed to transfer to novel instruction types, indicating that current methods cannot easily force models to hide their reasoning. Low CoT controllability is currently a feature rather than a bug, enhancing the reliability of CoT monitoring as a safeguard. Yet, it remains unknown whether this property will persist as AI capabilities advance. Therefore, we view CoT controllability evaluations as an early warning system, or canary eval, to detect if models are becoming capable of shaping their internal thoughts. While low controllability suggests low immediate risk, it does not guarantee long-term monitorability, as other mechanisms could still undermine oversight. To track these dynamics, we will begin reporting CoT controllability alongside monitorability metrics in system cards for future frontier models, starting with GPT-5.4 Thinking. This initiative builds on our commitment to avoid applying direct optimization pressure to the CoTs of frontier reasoning models, ensuring that the tools designed to protect us remain effective as the technology evolves.

Verwandte Links

Current Reasoning AI Models Struggle to Manipulate Their Own Thought Processes, Strengthening Safety Monitoring | Aktuelle Beiträge | HyperAI