Tencent Dropped an Open-Source AI Model. Its Benchmark Proof Is Closed.
Tencent says its new Hy3 model beats its predecessor by 40% on software engineering tasks. The catch: all the test results are the company own, and the model is open-source so anyone can verify.

Tencent released an artificial intelligence model on Thursday that the company says scored 74.4 percent on a widely used software engineering benchmark — a 40 percent jump over its predecessor. But unlike most AI announcements, this one came with a wrinkle: every number in the release came from Tencent's own internal evaluation suite.
The model, called Hy3 preview and based on the Hunyuan 3.0 architecture, is open-source. Its weights are available on HuggingFace, the machine learning community's primary repository, per the South China Morning Post. That means anyone with the right hardware can run the same benchmark. The question is why Tencent isn't pointing to independent test results, and what that silence says about the state of AI benchmarking in an industry that has grown comfortable grade-shopping.
The model's technical profile is notable independent of the benchmark dispute. Hy3 uses a mixture-of-experts design with 295 billion total parameters and 21 billion activated per token — smaller than Tencent's previous flagship, Hunyuan 2.0, which had over 400 billion parameters, per the South China Morning Post. Tencent says 295 billion represents the optimal balance between capability and inference cost, a claim that, if accurate, would be a meaningful signal that the industry's race toward trillion-parameter models is producing diminishing returns.
Yao Shunyu led the Hunyuan team's reconstruction and directed Hy3's development, according to Jianshi, a Chinese technology publication that reported training began in late January and wrapped in under three months. Yao joined Tencent after a circuitous path through US AI labs. He was a researcher at OpenAI, then at Anthropic, where he worked on Claude, before moving to Google DeepMind — all before joining Tencent. His departure from Anthropic in October 2025 was publicly cited as a response to the company's characterization of China as an adversarial nation, per the South China Morning Post.
That migration path illustrates a pressure the US government has been trying to reduce. H-1B approvals for Chinese AI researchers dropped 22 percent in fiscal year 2025, per USCIS data reported by the New York Times. Each departure from a US lab represents a researcher carrying institutional knowledge into a Chinese company that is now shipping products. Tencent is positioning Hy3 as proof that the talent it has acquired is producing results.
The benchmark numbers Tencent published — SWE-Bench Verified at 74.4 percent versus Hunyuan 2.0's 53.0 percent — are self-reported. The company said it built more than 50 internal benchmarks rather than relying on public leaderboards, according to Jianshi. Tencent's internal evaluation also showed first-token latency on its CodeBuddy and WorkBuddy products dropping by 54 percent and end-to-end task duration falling 47 percent, with reported success rates above 99.99 percent, per Jianshi. Those product-level metrics are harder to fake, but they cover a narrow slice of use cases.
The SWE-Bench Verified scores — 74.4 percent for Hy3 versus 53.0 percent for Hunyuan 2.0 — appeared on AIBase, per its benchmark tracking. The model's weights and technical specification are available on HuggingFace, according to the model card.
The OpenClaw compatibility claim Tencent made — that Hy3 was tested for compatibility with the open-source agent framework — is harder to evaluate without running the model in an agentic workflow. Tencent released two OpenClaw-compatible products, QClaw and WorkBuddy, in March 2026, per HelloChinaTech. Both are positioned as bridges between the OpenClaw ecosystem and Tencent's enterprise communication tools, WeChat Work and QQ. The commercial logic is straightforward: if Hy3 runs well in OpenClaw environments, it can plug into an existing ecosystem of enterprise automation tools rather than requiring Tencent to build that ecosystem from scratch.
Independent verification of the SWE-Bench claim would require running Hy3 against the standard SWE-Bench Verified harness. That is a non-trivial but reproducible exercise. A researcher at an academic lab or a well-resourced independent developer could do it. If the numbers hold, Tencent has a legitimate efficiency story — a smaller model outperforming a larger one is interesting if the benchmark is clean. If the numbers don't hold, the company has a credibility problem on day one.
Tencent's framing of what it optimized for is worth noting: the company says it explicitly stepped away from public leaderboards it believed were easily gamed. That is a legitimate critique of the industry's benchmarking culture. But it is also a convenient position for a company whose internal numbers haven't been independently verified. The answer is reproducibility — a company that publishes weights and runs a public benchmark on those weights has answered the question. One that publishes weights and cites its own internal test suite has not.


