Alibaba's Qwen 3.6 Plus approaches Anthropic's Claude Opus 4.6 on core software engineering benchmarks while running at nearly twice the inference speed — 158 tokens per second versus 93.5 — according to results published by Bridgemind and corroborated across community benchmark trackers.
The numbers, released March 30 when the model dropped on OpenRouter as a free preview, present a capability-speed combination that frontier labs have not yet shipped at this tier. On SWE-bench Verified, which tests whether a model can resolve real-world GitHub issues, Qwen 3.6 Plus scored 78.8 — narrowly behind Claude Opus 4.6's 80.8, per the March 2026 leaderboard, a gap most observers would call competitive. On Terminal-Bench 2.0, which stress-tests agentic terminal coding tasks, RenovateQR reported Qwen 3.6 at 61.6, ahead of Claude Opus 4.5 at 59.3 — though evaluation timing differences across reporting periods make that comparison imperfect.
The inference speed gap is where the story gets unusual. SpeedBench, which measures median throughput under sustained load, recorded Qwen 3.6 Plus at 158 tokens per second. Claude Opus 4.6, which Anthropic released on February 5, hit 93.5 tokens per second in the same conditions. That is a 69 percent speed advantage for a model that is not obviously behind on the tasks that matter most to developers.
"What you're seeing is the first open-weight model that doesn't make you choose between capability and speed," one engineer wrote in a widely-shared benchmark thread. It is the kind of claim that sounds like marketing until the numbers back it up, and in this case, they mostly do.
Qwen 3.6 Plus sits behind a significant caveat, however. On Bridgemind's Hallucination Bench, which tests whether a model's chain-of-thought reasoning produces fabricated information, the model posted a fabrication rate of 26.5 percent — roughly one in every four reasoning steps contained invented facts, code references, or citations that did not hold up under verification. The number stands out because the rest of the benchmark profile is so strong. A model that reasons quickly and accurately on most tasks but confabulates at that rate is useful for exploration and dangerous for production deployment without careful oversight.
The model also runs always-on chain-of-thought reasoning, with no toggle between thinking and non-thinking modes that the Qwen 3.5 series offered. Bridgemind flagged the median time-to-first-token at 11,520 milliseconds on the free preview tier — over eleven seconds before the first word appears on a shared inference service. That is a function of the free tier's queue and architecture overhead, not a product spec, but it is the gap between throughput and responsiveness that prospective users should understand.
Qwen 3.6 Plus is built on what sources describe as a next-generation hybrid architecture, a successor to the 3.5 series design. The one-million-token context window is competitive with every frontier model currently available.
On BridgeBench UI Bench, which tests graphical interface automation, Qwen 3.6 Plus scored 80.2 and ranked second behind GPT-5.4. On Security Bench, it posted an 82.4 average score with a 43.3 percent task success rate on hidden evaluation tests — respectable but below the top-tier frontier models on security-specific tasks.
What this adds up to is a model that makes a real case for being the most practical open-weight option at the high end of capability right now. The inference speed advantage, if it holds at scale beyond the free preview tier, changes the economics of deploying capable models in applications where latency matters — coding agents, terminal automation, document understanding. The fabrication rate means no one should ship it to end users without a human in the loop.
Claude Opus 4.6 remains the benchmark most labs compare against, and the gap has clearly narrowed. The question is whether that gap was ever about raw capability or about the integration and safety work that makes a model reliable enough to bet a product on. On raw numbers, Qwen 3.6 Plus is close. On trustworthiness at scale, the 26.5 percent fabrication rate is a reminder that benchmarks and production are different countries.
The model is available as a free preview on OpenRouter now.
† Either replace marc0.dev with a citation to a registered source (e.g., BuildFastWithAI or another registered benchmark aggregator) that confirms the Opus 4.6 score of 80.8, or add the footnote: "Source-reported; not independently verified." If the intended source was BuildFastWithAI, verify that it actually contains this Opus 4.6 figure and update the citation accordingly.