Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding
When Google benchmarked its new TPU inference stack last week, the finding that mattered was not the 3X speedup number. It was that verifying 1,024 tokens costs almost exactly the same as verifying 16.
That single measurement — K-Flat, as Google's engineers call it — is the architectural bet behind DFlash, a diffusion-style speculative decoding technique that Google shipped on TPU v5p this week. Speculative decoding works by having a fast draft model propose possible outputs and a larger model verify and accept them. The conventional wisdom was that verification cost scaled with the number of tokens accepted. Google's measurements suggest it does not — at least not on its hardware. Generating the draft is the expensive part; verifying a long block costs roughly what verifying a short one.
The practical implication: drafting stops being the bottleneck. The draft model paints an entire block in a single parallel forward pass, and the cost of that pass is essentially flat regardless of how many tokens it produces, per the Z Lab project description. The larger model then accepts or rejects the block as a unit. Google's own benchmarks show a 2.29x end-to-end serving speedup on TPU v5p, nearly double the 1.30x that EAGLE-3 — the previous state of the art — delivered in the same test environment.
DFlash originated in a UCSD paper posted to arXiv in February 2026, where it achieved over 6x lossless acceleration across a range of models and tasks. The paper's core claim was that generating draft tokens in a single forward pass turns drafting from an O(K) sequential operation into an O(1) constant-time one. Google's production implementation on its own silicon is the confirmation that this is not a research curiosity.
The architectural difference matters more than the speedup number. Standard speculative decoding drafts one token, then another, then another — each step depends on the last. DFlash paints the entire draft in parallel. For agentic use cases, where a model needs to produce a long, multi-step plan before a tool call fires, that turns the critical path from a sequential pipeline into a constant-time operation. The gap between a 500ms response and a 2-second response is often not model intelligence — it is how many sequential steps are in the critical path.
Independent benchmarks on different hardware corroborate the directional claim. A Spheron blog post measured DFlash at roughly 9,000 tokens per second on an H100 GPU, versus approximately 3,600 tokens per second for EAGLE-3. Cost math follows: DFlash reduced cost to about $0.06 per million output tokens from $0.16 for the comparable EAGLE-3 setup. An independent test on Apple Silicon M5 Max achieved 85 tokens per second, a 3.3x improvement over standard methods.
The catch is checkpoint availability. DFlash requires fine-tuned diffusion checkpoints — you cannot apply it to any model off the shelf. Checkpoints exist for certain open-source models like Qwen and Llama. No public checkpoint is available for GPT-4 class models. If your production system runs on a frontier closed model, DFlash does not apply today.
What DFlash on TPUs actually signals is harder to summarize than a speedup number. Block diffusion drafting is a genuine architectural shift — the draft is no longer a sequential prediction problem but a parallel generation one. That rethinks the cost structure of a wide class of inference workloads, particularly the multi-turn agent loops where sequential drafting has been the tax on response time. Whether that tax disappears for most production systems depends entirely on whether someone builds checkpoints for the models actually running in production.