Anthropic and OpenAI Report Findings of Joint AI Safety Tests

By Paula Parisi
September 2, 2025

OpenAI and Anthropic — rivals in the AI space who guard their proprietary systems — joined forces for a misalignment evaluation, safety testing each other’s models to identify when and how they fall short of human values. Among the findings: reasoning models including Anthropic’s Claude Opus 4 and Sonnet 4, and OpenAI’s o3 and o4-mini resist jailbreaks, while conversational models like GPT-4.1 were susceptible to prompts or techniques intended to bypass safety protocols. Although the test results were unveiled as users complain chatbots have become overly sycophantic, the tests were “primarily interested in understanding model propensities for harmful action,” per OpenAI.

Following the joint tests with Anthropic, which took place in early summer, OpenAI says it pushed the ball down the field with the August release of GPT‑5, which shows “substantial improvements in areas like sycophancy, hallucination, and misuse resistance,” according to a comprehensive post from OpenAI.

The companies worked with publicly available models, using developer APIs with some of the safety filters relaxed to avoid interfering with the evaluation. The conditions were not intended to recreate real-world situations, but were aimed at understanding “the most concerning actions that these models might try to take when given the opportunity,” Anthropic reports in its thorough findings post.

When it comes to scheming, enabling reasoning was not always a safety advantage, but VentureBeat notes “the findings showed that generally, reasoning models performed robustly,” with OpenAI o3 “better aligned than Claude 4 Opus” (while o4-mini, GPT-4o and GPT-4.1 often raised more concerns than either Claude model).

The reasoning models that were part of the tests “tended to give the strongest performance across the evaluations,” which included adhering to human direction (instruction hierarchy) and hallucinations.

With regard to hallucinating, Anthropic’s Claude models “refused to answer up to 70 percent of questions when they were unsure of the correct answer,” while “OpenAI’s o3 and o4-mini models refuse to answer questions far less, but showed much higher hallucination rates, attempting to answer questions when they didn’t have enough information,” explains TechCrunch.

The joint safety research “arrives amid an arms race among leading AI labs like OpenAI and Anthropic, where billion-dollar data center bets and $100 million compensation packages for top researchers have become table stakes,” TechCrunch reports, adding that “some experts warn that the intensity of product competition could pressure companies to cut corners on safety in the rush to build more powerful systems.”

VentureBeat offers tips for enterprise safety testing guidelines for GPT-5, including continued audits even after deployment, providing links to tests.

Anthropic and OpenAI Report Findings of Joint AI Safety Tests

No Comments Yet

Leave a comment