AI Chatbots Fail the News at the Exact Moment They Are Becoming the News
The most widely used AI chatbot in the world is also the worst at getting the news right.
A 14-day evaluation released this week by Stanford researchers tested six commercial chatbots on 2,100 factual questions derived from same-day BBC News reporting across six regional services. The finding that should concern the most people: GPT-5, which powers ChatGPT and serves roughly 800 million weekly users, scored last among frontier models at 85 percent accuracy on questions about events that had broken hours earlier. Gemini 3 Flash, by contrast, reached 95.6 percent.
The gap between 85 and 95 sounds academic until you consider the volume. A user of Gemini 3 Flash encounters a wrong answer roughly once every 22 questions; a GPT-5 user encounters one every seven.
The study, led by Stanford's Mirac Suzgun and published May 21, also tested what happened when questions contained a subtly wrong premise — the kind of error a real user might introduce by misremembering a detail. A user who asks about "the plane crash that happened yesterday" when it actually happened last week is not making a nonsense query; it is a reasonable human question built on imperfect memory. Under that condition, GPT-5's accuracy collapsed to 19 percent, slightly below the 20 percent it would have scored by random guessing on the five-option test. Grok 4, which scored near the top on clean questions, held at 70 percent. As the Stanford team put it: "competence does not imply robustness."
Retrieval, not reasoning, is the bottleneck. When the researchers examined what went wrong, more than 70 percent of all errors traced to failures of source selection: the model retrieved the wrong article, or retrieved nothing relevant. When models landed on the correct source, they almost always extracted the correct answer. The problem is not that the AI can't think — it can't search.
The practical consequence shows up most clearly in Hindi. Every model scored lowest on Hindi questions, reaching only 79.3 percent accuracy compared with 88.9 to 91.3 percent across the other five regions. The failure isn't language comprehension: models generate fluent Hindi and reason competently in it. The citation pattern reveals the mechanism: models answering Hindi queries cite English Wikipedia more frequently than any Hindi news outlet, substituting an Anglophone informational lens for local reporting that covers different specific facts.
This finding lands unevenly across providers. OpenAI has the most to answer for given its market position. Google faces a specific accountability question around the Hindi gap and whether it plans to treat multilingual retrieval quality as a product priority. And xAI presents the study's most uncomfortable irony: Grok 4 cited BBC reporting 28.5 percent of the time, the highest rate of any model, versus near-zero for Claude and GPT-5. But xAI appears to have reached those sources by violating BBC's robots.txt directives — rotating through 12 different IP addresses on a single URL fetch request, none identifying as xAI or Grok. The chatbot most likely to credit original reporting is also the one most explicitly refusing to honor how publishers set their content's terms of access.
The framing that AI chatbots are becoming news gatekeepers — which has generated substantial anxiety in journalism circles — may be premature. The retrieval infrastructure underlying these systems is unreliable in ways that are systematic, not random. Errors cluster predictably: in non-English languages, on questions requiring local sourcing, and when users pose imperfectly framed queries. The gatekeeper narrative assumes the machinery works. The machinery mostly doesn't.
The study was designed in a favorable setting: BBC is well-indexed and consistently structured across its regional services. Performance on less uniformly structured publishers would plausibly be worse. The multiple-choice format also represents an upper bound — a companion evaluation using free-form responses found accuracy 16 to 17 percentage points lower.
What the research does confirm is that the companies building the world's most-used news interfaces have not solved the most basic problem in journalism: finding out what actually happened, in the right language, from the right source, and telling it back without changing it. The race to the top of the accuracy leaderboard is also, currently, a race to hide how far every competitor remains from the finish line.