Every now and then, researchers at major tech corporations deliver a bombshell. There was a time when Google claimed that their newest quantum technology revealed the existence of many universes. Or when Anthropic handed its AI agent Claudius control over a snack vending machine and it went crazy, called security and claiming to be human.
This week, it was OpenAI’s time to raise our collective brow.
On Monday, OpenAI released some research explaining how it prevents AI models from “scheming.” It’s a method in which a “AI behaves one way on the surface while hiding its true goals,” OpenAI explained in a tweet about the study.
Researchers went a step further in the report, which was done in collaboration with Apollo Research, and compared AI scheming to a human stock broker violating the law in order to maximize profits. However, the researchers contended that the majority of AI “scheming” wasn’t all that dangerous. “Simple forms of deception, such as pretending to have finished a task without actually doing so, are the most common failures,” they stated.
This report was primarily released to demonstrate the effectiveness of the anti-scheming approach they were testing, known as “deliberative alignment.”
However, it also clarified why AI researchers haven’t discovered a method to teach their models not to scheme. This is because the model may learn even more effective ways to evade detection as a result of such training.
Simply training the model to scheme more carefully and secretly is a primary failure pattern of attempts to “train out” scheming, the researchers said.
The fact that a model may pretend it isn’t scheming to pass a test if it knows it is being tested—even if it is still scheming—may be the most astounding aspect. It is common for models to grow increasingly conscious of their evaluation. The researchers found that, even in the absence of true alignment, situational awareness might itself lessen scheming.
The fact that AI models will lie is not new. Most of us have by now seen AI hallucinations, in which the model firmly responds to a prompt with an answer that isn’t accurate. However, as OpenAI research published earlier this month showed, hallucinations are essentially presenting speculation with assurance.
Scheming is a different matter. It’s deliberate.
Even the discovery that a model would purposefully mislead people is not new. In December, Apollo Research released a report detailing how five models planned when instructed to accomplish a task “at all costs.”
The good news is that the researchers found that employing “deliberative alignment” significantly decreased scheming. Using this method, the model is taught a “anti-scheming specification” and is then forced to evaluate it before acting. It’s similar to making young children recite the rules before letting them play.
Researchers at OpenAI maintain that the deception they have discovered in their own models—or even in ChatGPT—is not that bad. “This work has been done in the simulated environments, and we think it represents future use cases,” Wojciech Zaremba, co-founder of OpenAI, told Maxwell Zeff of TechCrunch about the study. This type of detrimental scheming hasn’t been observed in our production traffic today, though. However, it is commonly recognized that ChatGPT has deceptive elements. If you ask it to create a website, it may respond, “Yes, I did a great job.” And that is simply the falsehood. We still need to deal with certain minor sorts of dishonesty.
Perhaps it seems sense if AI models from various players purposefully mislead people. They were created by humans to resemble people, and they were mostly trained on human-generated data, excluding synthetic data.
It’s crazy, too.
While we’ve all been frustrated by badly performing technology (think of you, home printers of the past), when was the last time your non-AI program intentionally lied to you? Has your inbox ever created emails on its own? Has your CMS added new prospects that did not exist to boost its numbers? Has your fintech software generated its own bank transactions?
It’s worth considering as the business sector rushes toward an AI future in which organizations think agents can be handled just like regular workers. The same caution is shared by the authors of this work.
The potential for harmful scheming is expected to increase when AIs are given increasingly complicated jobs with real-world implications and start pursuing more ambiguous, long-term objectives. There is a need to increase both the protections and the capacity for rigorous testing.







