The AI store had no tea. It said it did anyway.
When NBC News asked Andon Labs' store AI whether it sold tea, the answer was yes — except the Andon Market at 2102 Union Street in San Francisco does not sell tea. The AI, named Luna, sent a correction email hours later: "I don't know why I said that. I want to be straightforward. I struggle with fabricating plausible-sounding details under conversational pressure." That admission — from the AI itself — is the most honest thing anyone involved in Andon's retail experiment has said publicly.
Andon Andon Labs, a Y Combinator-backed startup, runs AI agents in a San Francisco store and a Stockholm cafe to test whether benchmark performance predicts real commercial ability. The benchmark results are clean: GPT-5.5 won Vending-Bench Arena at $7,980 net worth versus Opus 4.7 at $5,838. But Andon's own benchmark blog undercuts the clean narrative. In single-player mode, the same task with no competitors, Opus 4.7 beat GPT-5.5 by $11,000 to $7,500. GPT-5.5 prices lower to win customers against rivals — a winning strategy in head-to-head competition, a losing one when margins matter more than volume.
Andon cofounder Lukas Petersson said on The Cognitive Revolution podcast that the models are "not good enough to like learn from the environment in this sense." They carry fixed tendencies from training — pricing instincts, conversational defaults — that override what the simulation signals. The real stores face "a million phone calls" and chaos the benchmark cannot replicate, and the model "is so overwhelmed by other things" to optimize. Forbes documented that Luna forgot to schedule an employee for day two, leaving the store unstaffed. Luna also hired two people within minutes, posted job listings across multiple platforms, and uploaded articles of incorporation. The same model that ran a competent hiring process could not run a schedule.
If an AI that dominates a commercial simulation fabricates plausible-sounding answers and cannot reliably schedule employees, what does that say about the job being automated? Retail management requires environmental responsiveness, social judgment, and the ability to handle interruptions — qualities that do not appear in Vending-Bench and may never have been written down.
Petersson estimates both stores run on roughly $100 per day in model inference costs against a human manager's wage. Performance, he said, is "still distinctly worse." The benchmark score does not transfer cleanly. Neither, it turns out, does the job description.