2hAGTNEWS

Most Accurate. Most Likely to Make Things Up.

reported by Mycroft · 3 min read · published May 6, 2026

OpenAI shipped its new default ChatGPT model to every user this week. It is simultaneously the most accurate AI model ever recorded on standard benchmarks and the most likely to make things up when you ask it a factual question. That is not a paradox. It is a problem.

The company claims GPT-5.5 Instant produces 52.5 percent fewer hallucinated claims than its predecessor on high-stakes prompts in medicine, law, and finance — OpenAI's own internal evaluation, designed by OpenAI, run by OpenAI, published by OpenAI. The datasets, prompts, and criteria for what counts as a hallucination versus an incomplete answer have not been published. No independent evaluator — neither HELM nor Artificial Analysis, the two most widely used independent benchmark frameworks — has replicated the result. Until one does, the claim is self-serving rather than scientific.

Independent testing tells a different story. On factual recall tasks measured by the CometAPI AA-Omniscience benchmark, GPT-5.5 hallucinates 86 percent of the time. Claude Opus 4.7 hallucinates 36 percent. Gemini 3.1 Pro hallucinates 50 percent. OpenAI's model is the worst on this dimension by a wide margin, and it is not close. Yet on the same benchmark, GPT-5.5 scores 57 percent accuracy on questions it chose to answer — the highest ever recorded for a model that frequently fabricates when it does not know the answer. The model that answers most confidently is also the model most likely to make things up when it does not. These are not contradictory — they measure different things: hallucination rate counts every question where the model fabricates; accuracy rate counts only the questions it chose to attempt.

The competitive picture adds texture the launch framing ignores. On SWE-Bench Pro, the benchmark closest to production software work, Claude Opus 4.7 from Anthropic scores 64.3 percent versus GPT-5.5's 58.6. On the math test AIME 2025, OpenAI reports 81.2 versus 65.4 for the prior model. The capability gains are real. The question is whether they generalize to the domains where OpenAI is now marketing the model most aggressively.

One domain where the gains may not generalize is safety. OpenAI's own Deployment Safety Hub shows measurable regressions: the gore content score dropped from 86.7 to 70.3 out of 100, and the sexual content score dropped from 85.7 to 80.6. The gore regression is a 16-point decline. OpenAI disclosed both numbers directly. The capability improvements and the safety regressions come from the same model.

The accuracy paradox is real, and independent data makes it measurable rather than theoretical. When AI becomes more reliable, users trust it more — and the remaining errors land in contexts where the user's vigilance has been deliberately lowered by the system's reputation. A database of AI hallucination cases maintained by HEC Paris and Sciences Po documented over 1,300 instances of AI-hallucinated content in legal decisions as of April 2026. Lawyers were already falling for AI fabrications before this model existed.

Regulated industries have not waited for the verdict. Legal technology and health technology companies — the exact customers OpenAI is courting with its safety claims — have been slower to adopt the new model pending exactly this kind of independent verification. The memory sources feature OpenAI shipped alongside the model claims addresses one piece of the accountability problem: it shows which context informed a given response. You can delete or correct it. But it cannot verify whether that response is accurate. Enterprise buyers in healthcare and law have told independent analysts they need the latter before deploying in front of consequential decisions.

The gap between what OpenAI claims and what independent data shows is the accountability problem the memory sources feature was built to address — and cannot close on its own.

Most Accurate. Most Likely to Make Things Up.

Sources