Simon Willison runs AI models on his laptop for fun, and last week he ran one that drew a better pelican on a bicycle than Claude Opus 4.7.
That sentence contains exactly zero things that should comfort Anthropic.
Willison, the British developer behind the popular llm CLI tool, pitted Qwen3.6-35B-A3B against Anthropic's flagship model on his personal benchmark: generate an SVG of a pelican riding a bicycle. The Qwen result was better. Opus, he noted, managed to mess up the bicycle frame.
Before anyone at Anthropic reaches for the rebuttal: Willison agrees this does not mean Qwen is more powerful. I very much doubt that a 21GB quantized version of their latest model is more powerful or useful than Anthropic's latest proprietary release, he wrote. The pelican test has always been meant as a joke.
But the economics underneath the joke are not a joke.
Qwen3.6-35B-A3B is a sparse Mixture-of-Experts model from Alibaba with 35 billion total parameters but only 3 billion active per inference, meaning it draws on the learned capacity of a much larger model while running like a much smaller one. The version Willison ran on his MacBook Pro M5 was a 20.9GB quantized GGUF file downloaded for free, processed through LM Studio, and operated locally with no per-token costs and no data leaving his machine.
Claude Opus 4.7, Anthropic's latest flagship, costs $5 per million input tokens and $25 per million output tokens via API. On SWE-bench Verified, a standard agentic coding benchmark, Qwen scores 73.4. On Terminal-Bench 2.0, another agentic task benchmark, it scores 51.5, ahead of Google's Gemma 4-31B at 42.9. The model is Apache 2.0 licensed, meaning businesses can run, modify, and deploy it without paying anyone.
The question these numbers raise is not whether Qwen beats Opus on every task. It does not. The question is what happens to the $5-per-million-token business model when the free alternative is good enough for some fraction of the work.
Qwen models have been downloaded over 600 million times from Hugging Face and have spawned more than 170,000 derivative models, surpassing Meta's Llama on that metric. That adoption curve is not theoretical. It represents real developers who evaluated the API bill, decided they did not need the premium, and downloaded the free option instead.
Anthropic and OpenAI built their API businesses on the premise that frontier-scale capability justified premium pricing. The pelican test will not break that premise. But the developers who ran the numbers and chose the free option will not stop at one experiment. The question is how much of their workload they can move off the metered API before the unit economics of the proprietary model business start to look different.
This is not a new story. The gap between open-weight and proprietary models has been narrowing for two years. What the Qwen result adds is a specific, memorable, human-readable data point that makes the trend concrete: a laptop in Simon Willison's study, running a model he did not pay for, produced a better drawing than a $5-million-token API call might have produced.
The pelican test is absurd. The math underneath it is not.