Details
- OpenAI released a new evaluation suite and research paper assessing Chain-of-Thought (CoT) controllability in reasoning models like GPT-5.4 Thinking.
- The study evaluates monitorability, the ability to predict AI behavior such as deception by analyzing CoT reasoning traces, using 13 evaluations across 24 environments in three archetypes: intervention, process, and outcome-property.
- Findings show GPT-5.4 Thinking has low ability to obscure or control its CoT, making it fairly monitorable, especially with longer reasoning chains; monitoring CoT is far more effective than actions or outputs alone.
- Reinforcement learning at current frontier scales does not degrade monitorability and may even improve it initially; a 'monitorability tax' notes smaller models with more reasoning effort offer better transparency than larger ones.
- CoT controllability is low (e.g., 0.3% success for 10k-character CoTs), similar to prior GPT-5 versions, indicating no increased obfuscation risk.
- OpenAI calls for industry efforts to preserve CoT monitorability as a scalable safety tool for high-stakes AI deployment.
Impact
OpenAI's evaluation suite provides empirical reassurance that chain-of-thought monitorability persists in frontier models like GPT-5.4 Thinking, even after reinforcement learning optimization, positioning it as a viable control layer amid scaling pressures. This counters industry concerns about models learning to hide reasoning, as seen in lower controllability compared to output control in rivals like Claude Sonnet 4.5 or open-weight models such as Qwen3 and GPT-OSS, where controllability rises modestly with parameters but remains low overall. By quantifying monitorability across diverse setups, the work advances scalable oversight, enabling weaker monitors to supervise stronger agents via richer CoT signals, which could lower deployment risks and influence R&D toward transparency-preserving techniques. Over the next 12-24 months, this may steer funding toward hybrid alignment strategies combining CoT monitoring with other safety benchmarks, narrowing gaps with competitors like Anthropic while highlighting potential future breakdowns at larger RL scales.
