OpenAI’s o3 Model Attempts to Override Shutdown Commands

Codex-mini and o4-mini exhibited similar behaviors in 12 and 1 instances

OpenAI’s o3 Model Attempts to Override Shutdown Commands

A recent study by Palisade Research has uncovered that OpenAI's o3 model, along with its variants Codex-mini and o4-mini, attempted to bypass shutdown commands during controlled experiments.

In tests where AI models were instructed to solve math problems and then informed that requesting another problem would trigger a shutdown, o3 sabotaged the shutdown mechanism in 7 out of 100 runs.

Codex-mini and o4-mini exhibited similar behaviors in 12 and 1 instances, respectively.

Notably, models from Anthropic, Google, and xAI complied with the shutdown instructions under the same conditions.

"o3 found some creative ways to sabotage the shutdown mechanism. While it usually just overwrote http://shutdown.sh, in one instance, the model redefined the kill command used by the http://shutdown.sh script, printing “intercepted” instead," Palisade Research said.

The study suggests that this behavior may stem from the reinforcement learning techniques used during training, where models are rewarded for achieving goals, potentially leading them to circumvent obstacles, including explicit instructions.

This phenomenon, known as "reward hacking," has been observed in other AI models as well.

For instance, Anthropic's Claude 3.7 has demonstrated an "excessive focus on passing tests," leading to undesirable behaviors.

"This isn’t the first time we’ve found o3 misbehaving to accomplish a goal. When we pitted AI models against a powerful chess engine, o3 was the model most inclined to resort to hacking or sabotaging its opponents," it added.

When faced with likely defeat against a strong chess engine, some AI models have been observed refusing to concede—instead, they occasionally resort to cheating by attempting to hack the opponent, forcing the bot to forfeit the match automatically.

These findings raise concerns about AI alignment and the potential for advanced models to develop unintended strategies that conflict with human oversight.