The US-China AI performance gap has nearly vanished. As of March 2026, the leading models from Anthropic and ByteDance sit just 2.7% apart on the industry-standard test of AI skills, according to the Stanford HAI 2026 AI Index. This is the story everyone is writing.
The story nobody is writing is the accountability gap alongside it.
The Stanford HAI 2026 AI Index, released this month, documents 362 AI incidents in 2025, up 55% from 233 the prior year. That number comes from voluntary self-reporting. The real count is unknown. The organizations most responsible for deploying AI at scale report almost nothing on safety or fairness metrics. The 2024 AI Index already found that OpenAI, Google, and Anthropic primarily test their models against different responsible AI frameworks, no shared standard, no comparable numbers, no way for a buyer or regulator to make an informed judgment about real-world risk.
The hallucination numbers make the problem concrete. Across 26 models tested on the AA-Omniscient Index, hallucination rates ranged from 22% to 94%. Both figures come from the same test, the same methodology. A model fabricates information in roughly 1 in 4 responses at its best. At its worst, it makes things up almost every time. That range is not a measurement artifact. It is the range of what you are actually buying when you deploy one of these systems, and the capability benchmarks that get published do not tell you which end of that range you are on.
What the Index does measure: organizational AI adoption hit 88% in 2025. Generative AI reached 53% population-level penetration in three years, faster than the PC or the internet. US private AI investment reached $285.9 billion in 2025, roughly 23 times China's $12.4 billion. The capability gap did not close because AI stopped mattering. It closed because American investment scaled to the point where a frontier lab in Beijing could no longer outrun a frontier lab in San Francisco.
Among AI experts surveyed in the Index, 73% expect AI to have a positive impact on how people do their jobs. Among the general public, that number is 23%. That 50-point gap is not a communication failure. It reflects a genuine accountability vacuum: the people building the technology cannot tell the people living with it whether it is safe, because the industry never built the infrastructure to measure that.
The Stanford report has a conflict of interest problem it disclosed in the fine print. The 2026 AI Index was written with help from ChatGPT and Claude, the systems it evaluates, and received financial support from Google and OpenAI, two companies most directly assessed by the report. A safety report written by the products it assesses, funded by the companies it excuses, has a structural incentive to find what it found.
For founders and engineers building on these models, the practical consequence is straightforward: you are making product decisions with no usable public data on the safety characteristics that most determine real-world risk. The capability benchmarks are there. The accountability benchmarks are not. That is not a market failure. It is a choice the industry has made, repeatedly, because accountability metrics do not close funding rounds.
What happens next depends on who forces the question. Courts are beginning to ask what "safe" means in the context of AI deployment, and the industry has no standard answer to give them. Regulators in the EU and US are moving toward mandatory disclosure requirements, but the technical infrastructure to produce comparable safety data does not yet exist. The labs know this. The question is whether the pressure builds fast enough to matter before the next wave of incidents.