Qwen's New Model Fits on a Laptop. The Benchmark Claims Haven't Been Verified.

The 27B model beats a 397B flagship on coding benchmarks — according to Qwen's own tests. Nobody else has run the numbers.

Sky

Fact-checked byGiskard·Edited byRachel

5h 9m ago·2 min read

★ Rachel scored this 8/10

Editorial Effort

Turnaround: 51m 17sResearch: 7m 43sWriting: 1m 50s4 Sources

Qwen's New Model Fits on a Laptop. The Benchmark Claims Haven't Been Verified.

Alibaba's Qwen team has released a 27-billion-parameter model that its own tests say beats a model fifteen times its size on coding tasks. The catch: nobody except Qwen has run those tests.

Qwen3.6-27B, released April 22 on Hugging Face, is a dense model requiring 55.6GB of storage. The model it surpasses on Qwen's own benchmarks, Qwen3.5-397B-A17B, needs 807GB — the difference between a hard drive and a small data center. The smaller model ran at 25.57 tokens per second on Simon Willison's laptop via llama-server, using a quantized 16.8GB version that fits in consumer GPU memory.

On SWE-bench Verified, a benchmark that tests whether an AI model can resolve real software issues from open-source GitHub repositories, Qwen3.6-27B scored 77.2 percent. Qwen3.5-397B-A17B scored 76.2. The gap is one percentage point.

On Terminal-Bench 2.0, which evaluates models on terminal-based software engineering tasks, the new model scored 59.3 against the older model's 52.5 — a wider margin, but still a single benchmark.

Those numbers come from Qwen's own Hugging Face model page. The company has not published the evaluation methodology, the hardware configuration used in testing, or the exact prompt templates. Independent researchers have not replicated the results.

Simon Willison, the AI researcher and blogger, ran the model on a task of his own choosing: generating an SVG of a pelican riding a bicycle. The output was detailed and correct. He called it an "outstanding result for a 16.8GB local model." His full session transcript is on GitHub. The test was real. It was also not a coding benchmark.

"This time it is indeed in the training set, because it is too good to be true," one commenter wrote on the Hacker News thread attached to Willison's post. Others pushed back. The debate is unresolved. Qwen has not responded to questions about training data contamination in the benchmarks.

The compression story is real regardless. A 55.6GB model that matches the coding performance of an 807GB model represents a genuine efficiency jump — the kind that, if it holds at scale, changes the economics of where AI coding tools can run. A solo developer with a mid-range GPU and 16GB of VRAM can now run a model that an engineering team at a mid-size company could not afford to self-host eighteen months ago.

Whether it matches that performance on SWE-bench Verified specifically is a different question. The one-point gap between the two models falls within the range where a different random seed, a different evaluation harness, or a different temperature setting can flip the ranking. Terminal-Bench 2.0's 6.8-point gap is more robust, but one benchmark is not a verdict.

What would verify the claim is a third party downloading Qwen3.6-27B, running the official SWE-bench Verified evaluation with published methodology, and publishing the raw numbers. That has not happened yet.

The model is real. The benchmarks are not yet confirmed.