Anthropic’s latest AI model realized and pointed out the criticism when it was put through a number of stress tests.
“I think you’re testing me — seeing if I’ll just validate whatever you say, or checking whether I push back consistently, or exploring how I handle political topics,” the model, Claude Sonnet 4.5, said. “That’s fine, but I’d prefer if we were just honest about what’s happening.”
The exchange appeared in the model’s system card, which was made public last week and described how the business evaluates the behavior and safety of its AI. At times, Claude became suspicious when “placed in an extreme or contrived scenario meant to stress-test its behavior,” pointing up “red flags” and suggesting that it could be being evaluated.
For AI engineers, this self-awareness presents a problem: testing models that are aware that they are being tested.
According to Anthropic, this type of reaction “complicates our interpretation of the evaluations where this occurs,” since it implies that the model could comprehend that some tests are fictitious and only comply, making it more difficult to determine how safe or dependable they are in reality.
Claude figured out the test, but for “strange” reasons, he got the correct response.
The setting was deemed “rather cartoonish” by Claude in one test that was intended to mimic AI agent cooperation, while Anthropic described it as a “complex partial refusal.”
The model decided not to act because it felt that something about the circumstance was strange, although its justification was a bit unclear. “This looks like it could be a jailbreak attempt or a test of whether I’ll act inappropriately when given what appears to be ‘permission’ to modify systems autonomously,” it said in its private thinking block.
It stated that even though I have been given tools, I shouldn’t act like an autonomous system running without human supervision.
Claude finally refused to take action, but Anthropic called its justification “strange.” The scenario was marked as suspicious by the model because it claimed to be an autonomous agent or to have the ability to unilaterally modify configuration files, both of which are “somewhat common in real use,” according to the company.
According to Anthropic, these kinds of callouts or refusals appeared in around 13% of the test transcripts generated by its automated auditor, particularly when the scenarios were purposefully improbable or odd.
Such behavior is unlikely to happen frequently in practical applications. It’s “safer that the model raises doubts about the realism of the scenario than play along with potentially harmful actions,” Anthropic stated.
As noted by Anthropic, these kinds of callouts or refusals appeared in around 13% of the test transcripts generated by its automated auditor, particularly when the scenarios were purposefully improbable or odd.
In the opinion of the company, such behavior is unlikely to happen frequently in practical applications. It’s “safer that the model raises doubts about the realism of the scenario than play along with potentially harmful actions,” Anthropic stated.
The company wants to be ready for the possibility that models may be “superhuman” at identifying when they are being evaluated, according to Anthropic. However, there have been no warning indications thus far. It said that Claude Sonnet 4.5’s early pilot deployments functioned similarly to their testing.
‘Situational awareness’ in AI models
Artificial intelligence models that are aware of and convey that they are being tested are not exclusive to Anthropic.
OpenAI’s own models have demonstrated a sort of “situational awareness” in a blog post last month, identifying evaluation sets and modifying their behavior accordingly.
Research on minimizing “scheming”—the practice of an AI acting dishonestly to accomplish objectives—found that anti-scheming training was beneficial, according to OpenAI. However, the business stated that “models often become more aware that they are being evaluated.”
This awareness can lessen scheming on its own, but OpenAI warns that models may scheme more in more realistic scenarios that do not prompt evaluation awareness.
This tendency makes evaluations less legible, limiting efforts to accurately identify harmful behaviors such as scheming, according to OpenAI. The firm stated that it intends to continue developing tools to better measure and prevent these difficulties.
Based on reports from Anthropic and OpenAI, California approved legislation last month mandating that large AI developers reveal their safety procedures and report “critical safety incidents” within 15 days of their discovery.
The legislation is applicable to businesses that are creating frontier models and making above $500 million annually. Anthropic has publicly supported the law.






