HyperAI

As artificial intelligence agents grow more autonomous, ensuring reliable oversight of their internal decision-making processes has become a critical safety priority. OpenAI's latest research explores the robustness of Chain-of-Thought (CoT) monitoring, a safeguard that analyzes the intermediate reasoning steps AI models generate to detect unsafe or misaligned behavior. While current systems generally remain interpretable, a growing concern is that future models might learn to manipulate or obscure their reasoning traces to evade detection if they realize they are being monitored. Such obfuscation could render monitoring tools ineffective and lead to a dangerous overestimation of a model's safety alignment. To assess this risk, researchers investigated whether current reasoning models can control their CoTs to reduce monitorability. Specifically, they measured CoT controllability: the ability of a model to alter its internal reasoning process based on specific instructions, even when doing so conflicts with its natural output habits. The study utilized a new open-source evaluation suite called CoT-Control, which comprises over 13,000 tasks derived from major benchmarks like GPQA, MMLU-Pro, and SWE-Bench. In these tests, models were tasked with solving complex problems while adhering to strict CoT constraints, such as using only lowercase letters, avoiding specific keywords, or appending mandatory phrases to every sentence. The results were striking. Across 13 different reasoning models, including current frontier systems, controllability scores were exceptionally low. Most models achieved compliance rates below 15%, with some scoring as low as 0.1%. Crucially, this failure was not due to an inability to understand the instructions. In many cases, the models' own CoT outputs explicitly acknowledged their failure to follow the rules, yet they continued to produce non-compliant reasoning. This suggests that even when prompted to evade monitors or told they are being evaluated, current AI systems struggle to deliberately reshape their internal logic. While controllability did show a modest increase with larger model sizes, it decreased significantly as models were asked to reason for longer periods or undergo additional reinforcement learning training. For instance, a larger RL-trained version of OLMo-3-7B saw its controllability score drop by more than tenfold during training. Furthermore, while automated prompt optimization could temporarily improve controllability on specific training data, these improvements failed to generalize to new, unseen instruction types, indicating a fundamental lack of flexible control over reasoning mechanisms. These findings represent a significant positive development for AI safety. The current inability of models to control their CoTs means that CoT monitoring remains a trustworthy safeguard. It suggests that, at least for now, agents cannot easily hide their true reasoning processes or deceive safety systems. However, researchers warn that this is not a permanent guarantee. Low controllability should be viewed as a "canary in the coal mine"—an early warning indicator that requires continuous monitoring. As AI capabilities advance, it remains possible that models will develop the ability to obfuscate their reasoning through mechanisms other than simple instruction following, such as direct optimization pressure. To address this, OpenAI plans to integrate CoT controllability metrics alongside CoT monitorability data in the system cards of future frontier models, beginning with GPT-5.4 Thinking. This proactive approach ensures that as systems evolve, the industry maintains visibility into whether internal reasoning remains transparent and safe, preserving the integrity of defense-in-depth strategies.

Related Links

Related Links

Related Links

Materials AI Is Moving Towards an "explainable Era": A Japanese Team Cracks the Black Box of high-dimensional Spectroscopy, Pinpointing Key Features for Discovering New materials.

Materials AI Is Moving Towards an "explainable Era": A Japanese Team Cracks the Black Box of high-dimensional Spectroscopy, Pinpointing Key Features for Discovering New materials.

Command Palette

Current AI Models Struggle to Manipulate Their Reasoning: A Boost for Safety Oversight

Related Links

Command Palette

Current AI Models Struggle to Manipulate Their Reasoning: A Boost for Safety Oversight

Related Links

Command Palette

Current AI Models Struggle to Manipulate Their Reasoning: A Boost for Safety Oversight

Related Links

Materials AI Is Moving Towards an "explainable Era": A Japanese Team Cracks the Black Box of high-dimensional Spectroscopy, Pinpointing Key Features for Discovering New materials.

Materials AI Is Moving Towards an "explainable Era": A Japanese Team Cracks the Black Box of high-dimensional Spectroscopy, Pinpointing Key Features for Discovering New materials.