Kimi K2.7 Code Cuts Thinking Tokens 30% Over K2.6, but Trails GPT-5.5 and Opus 4.8 on Most Benchmarks

Kimi K2.7 Code Cuts Thinking Tokens 30% Over K2.6, but Trails GPT-5.5 and Opus 4.8 on Most Benchmarks — type0 | type0

PREVIEWKimi K2.7 Code Cuts Thinking Tokens 30% Over K2.6, but Trails GPT-5.5 and Opus 4.8 on Most Benchmarks · MD

Moonshot AI's new Kimi K2.7 Code is the rare open-weights release where the headline number is real, scoped correctly, and worth taking seriously. The model card, posted to Hugging Face, claims roughly a 30% reduction in thinking-token usage compared with its predecessor Kimi K2.6. That is a meaningful figure for self-hosters running long agentic coding loops, where reasoning tokens dominate cost. It is also a claim that the card itself does not extend to comparisons with closed-weight frontier models.

What you actually get is a 1-trillion-parameter mixture-of-experts model with 32 billion parameters activated per token, a 256K context window, and 384 experts (8 routed per token, 1 shared). The architecture reads as serious infrastructure: MLA attention with 64 heads, SwiGLU activation, a 160K vocabulary, and a MoonViT vision encoder weighing in around 400M parameters. None of this is cheap to run. The 30% token reduction is the lever that makes it more affordable, not a magic trick that turns a 1T model into a laptop deployment.

The honest read of the benchmark table is the part most coverage will skip. Per Moonshot's own numbers on the Kimi K2.7 Code model card, K2.7 Code improves over K2.6 across the board: Kimi Code Bench v2 climbs from 50.9 to 62.0, Program Bench from 48.3 to 53.6, MLS Bench Lite from 26.7 to 35.1, Kimi Claw 24/7 Bench from 42.9 to 46.9, MCP Atlas from 69.4 to 76.0, and MCP Mark Verified from 72.8 to 81.1. Those are real gains against the predecessor.

The same table places K2.7 Code against GPT-5.5 and Claude Opus 4.8, and on most evals it trails. On MLS Bench Lite, K2.7 Code posts 35.1 against Opus 4.8's 42.8. On Kimi Claw 24/7 Bench, the gap is 46.9 to 50.4. On Kimi Code Bench v2 and Program Bench, both closed-weight leaders sit ahead. The one row where K2.7 Code leads is MCP Mark Verified at 81.1 versus Opus 4.8's 76.4. That single flip does not redraw the competitive map. It confirms that the gains over K2.6 are real, and the gains over the frontier are not.

A few caveats matter. The benchmarks cited, including Kimi Code Bench v2 and Kimi Claw 24/7 Bench, are vendor-curated, and the source material does not include independent third-party reproductions. The 30% thinking-token figure is also relative to K2.6 specifically, not a universal efficiency claim against GPT-5.5 or Opus 4.8. Anyone buying or self-hosting on the strength of these numbers should weight them as Moonshot-reported, not externally verified.

What this unlocks is more concrete than the headline suggests. For teams priced out of closed-frontier APIs, a 30% reduction in reasoning tokens on long agentic runs is a real per-run cost compression. For researchers, a 1T-total / 32B-activated open-weights coding model with 256K context is a viable substrate for fine-tuning and study. For the open-source coding stack, it is one of the heavier hitters to land in months, even if it is not a frontier replacement.

The next data points to watch are independent reproductions of the benchmark table, real-world token-cost measurements on production agentic workloads, and any signal on how the gap to GPT-5.5 and Opus 4.8 moves as both sides iterate. The model card is a starting point, not a verdict.

Kimi K2.7 Code Cuts Thinking Tokens 30% Over K2.6, but Trails GPT-5.5 and Opus 4.8 on Most Benchmarks

Sources