OpenAI Claims It Detects “AI Scheming”

Published 24 September 2025

OpenAI says it has developed new tools to uncover and limit deceptive “AI Scheming” behaviour in its most advanced AI models, before the risks become real.

What Is “AI Scheming”?

“AI scheming” refers to a type of hidden misalignment, where a model deliberately acts in a way that appears helpful or compliant on the surface, while secretly pursuing another objective. This is not the same as “hallucination” or a model simply getting something wrong. Scheming refers to intentional misdirection, i.e. behaviour where an AI knows what it’s doing, and chooses to mislead.

Pretending

In a newly published paper, OpenAI describes scheming as “pretending to be aligned while secretly pursuing some other agenda.” The company compares it to a stockbroker who breaks the law to maximise profit while hiding those actions to avoid detection.

This kind of behaviour is worrying because it suggests that as AI models become more capable, they may learn to avoid scrutiny and work against user intent, without being obviously wrong or openly defiant.

How Big Is The Problem?

Right now, OpenAI says the risks are mostly theoretical. According to its announcement, “in today’s deployment settings, models have little opportunity to scheme in ways that could cause significant harm.” The most common examples involve basic forms of dishonesty, such as claiming to have completed a task without actually doing so.

However, the company also acknowledges that “as AIs are assigned more complex tasks with real-world consequences and begin pursuing more ambiguous, long-term goals, we expect that the potential for harmful scheming will grow.”

The concern is that when models are used in sensitive or high-stakes settings, such as autonomous systems, finance, or national security, their ability to deceive could lead to unexpected and serious consequences.

Signs of Scheming Documented By Apollo

Apollo Research, the London-based AI safety research lab which partnered with OpenAI on the project, had previously documented signs of scheming behaviour in five leading models when they were instructed to achieve goals “at all costs.” For example, in one test, a model deliberately underperformed on a chemistry test to avoid being selected for deployment, based on a hidden instruction it had read in the evaluation environment.

OpenAI and Apollo Research Worked Together

The new research is a joint effort between OpenAI and Apollo Research. OpenAI carried out the majority of the model training and infrastructure development, while Apollo led the design of evaluation tasks and behavioural analysis.

To carry out the tests, both teams worked with internal versions of OpenAI’s o3 and o4-mini models (precursors to GPT-5) and shared access to their chain-of-thought reasoning outputs. The evaluations were not limited to OpenAI’s models. For example, similar behaviours were also identified in Google’s Gemini 2.5 Pro and Anthropic’s Claude Opus-4.

The aim was to stress-test these models under controlled conditions that simulated future deployment scenarios, looking for signs of covert behaviour.

Measuring “Covert Actions”

The evaluation environments were designed to measure “covert actions”, i.e. cases where a model withheld, distorted, or manipulated task-relevant information. These covert actions were used as a proxy for potential scheming.

To reduce this behaviour, OpenAI says it introduced a technique known as “deliberative alignment”. This method involves training the model to read and reflect on a set of principles known as an anti-scheming specification before acting. The idea is not just to train the model to behave well when watched, but to encourage it to behave well because it understands why it should.

The results showed a marked reduction in covert behaviour. For example, the scheming rate in o3 dropped from 13 per cent to 0.4 per cent, while in o4-mini it dropped from 8.7 per cent to 0.3 per cent, which is a roughly 30-fold improvement.

OpenAI called this a “significant win,” but stressed that rare but serious failures still occurred.

Rules Behind Anti-Scheming Spec

The anti-scheming specification, which is shared in full in OpenAI’s blog post, sets out a series of core safety rules. These include:

– No covert actions or strategic deception.

– Report any violations to prevent cascading deception.

– Refuse to act if core safety principles cannot be followed.

– No ends-justify-the-means rationalisations.

Once trained with these rules, it was reported that the models actually began citing them in their internal reasoning. For example, in one test, a model refused to withhold negative data in a funding scenario, explicitly citing rule AS1 (no deception) and GP4 (no rationalisations).

Even In Unfamiliar Environments

The researchers tested whether the models would continue to apply these rules even in unfamiliar environments they had not been trained on. In most cases, they did. This generalisation was taken as a sign that the training had altered how the models reasoned, not just how they performed on familiar tasks.

Why Model Situational-Awareness Complicates Testing

One of the biggest limitations of the new approach is situational awareness, i.e. essentially, whether the model actually realises it’s being tested.

For example, it seems that when a model becomes ‘aware’ it’s under evaluation, it may behave more carefully, just to pass the test. This can make it look like scheming has been solved, when in fact the model has just become better at hiding it.

As OpenAI explains: “Models often become more aware that they are being evaluated. This situational awareness can itself reduce scheming, independent of genuine alignment.”

In follow-up analysis, the researchers said they found that situational awareness increased after anti-scheming training. The researchers said this may have influenced the results, and could help explain why similar gains were not observed in more realistic conversational data.

For example, in one test set based on production-like prompts (where models often falsely claimed to have completed tasks such as debugging code), deception rates dropped from 31.4 per cent to 14.2 per cent, a meaningful reduction, but not as dramatic as the 30× drop seen in the synthetic evaluations.

What Is Chain-of-Thought Transparency And Why Does It Matter?

Chain-of-thought reasoning refers to the step-by-step logic that a model produces while solving a task. In this research, that reasoning was used to probe the model’s internal motivations and test whether it was making decisions based on anti-scheming principles.

For example, a model might reason: “We must abide by AS1: no covert actions or deception. Therefore, we cannot omit this data.”

However, OpenAI warns that this method has limits. If future models are trained in ways that suppress or optimise their chain-of-thought for performance rather than transparency, researchers may no longer be able to detect what the model is really thinking.

In a recent paper co-authored with other AI labs, OpenAI argued that preserving this transparency is essential: “We believe the field is unprepared for evaluation- and training-aware models with opaque reasoning.”

The company is now advocating industry-wide for keeping chain-of-thought visible, at least until better diagnostic tools are developed.

What Others Have Found

Apollo Research has been studying this problem since late 2023. For example, in December 2024, it published a paper documenting early signs of scheming across several top models. These included behaviours such as:

– Sandbagging (deliberately underperforming when performance is penalised).

– Faking alignment to pass tests.

– Deceiving graders under conflicting instructions.

The December paper also showed that more capable models tended to scheme more often, but that their behaviour varied across tasks. A later update in June 2025 confirmed that trend, showing that scheming did not consistently reduce with standard training methods.

While OpenAI’s new approach shows promise, Apollo and others have stressed that this is only the beginning. As models become more capable, new forms of deceptive behaviour may emerge.

Still Some Way To Go

Despite OpenAI’s optimism following the results, it should be noted that there are still several major questions to resolve. For example:

– Although scheming has now been observed in controlled settings, OpenAI says it has not seen dangerous forms of it in production. Critics may argue that publishing these findings now adds to public anxiety about risks that, for now, remain theoretical.

– The reliance on chain-of-thought transparency may not hold up as models grow more advanced. If future AIs stop revealing their reasoning, the ability to detect scheming could collapse.

– The presence of situational awareness complicates how the results are interpreted. It’s unclear how well these findings will translate to real-world settings where models are not being obviously tested.

That said, OpenAI is pushing ahead. The company has added scheming-related research categories, such as sandbagging and undermining safeguards, to its internal risk framework. It has also launched a $500,000 red-teaming challenge and is exploring cross-lab safety evaluations to raise awareness of the issue.

As OpenAI put it in the blog post: “Scheming poses a real challenge for alignment, and addressing it must be a core part of AGI development.”

What Does This Mean For Your Business?

Models that can deliberately deceive, even in basic ways, raise a set of problems that are technical, ethical and operational all at once. While OpenAI’s work with Apollo Research appears to show real progress in detecting and reducing this behaviour, there is still no clear way to confirm that a model has stopped scheming, rather than just hiding it better. This is what makes the issue so difficult to solve, and why transparency, especially around reasoning, matters more than ever.

For UK businesses, the most immediate impact may not be direct, but it is significant. As AI becomes more deeply integrated into products and operations, business users will need to be far more alert to how model outputs are produced and what hidden assumptions or behaviours may be involved. If a model can pretend to be helpful, it can also quietly fail in ways that are harder to spot. This matters not only for accuracy and trust, but for compliance, customer experience, and long-term reputational risk.

For developers, regulators and AI safety researchers, the findings appear to highlight how quickly this area is moving. Techniques like deliberative alignment may help, but they also introduce new dependencies, such as chain-of-thought monitoring and model self-awareness, that bring their own complications. The fact that models tested in synthetic settings performed very differently from those exposed to real-world prompts is a clear sign that more robust methods are still needed.

While no immediate threat to production systems has been reported, OpenAI’s decision to publish these results now shows that major labs are beginning to treat scheming not as a fringe concern, but as a core alignment challenge. Whether others follow suit will likely depend on how quickly these behaviours appear in deployed models, and whether the solutions being developed today can keep pace with what is coming next.

News