Local AI developers gained a new option this week. Google DeepMind released DiffusionGemma, a 26-billion-parameter open model that generates text in 256-token blocks rather than one token at a time, according to the company's DeepMind blog announcement, cross-posted on the Google blog. On a single NVIDIA H100, Google reports the model crosses 1,000 tokens per second; on a consumer RTX 5090, the announcement puts throughput at 700+ tokens per second. The release is licensed under Apache 2.0, which puts it in the same permissively licensed tier as the rest of the Gemma family.
The headline number is a Google self-measurement on dedicated GPUs, and the company frames its "up to 4x faster" claim as a vendor benchmark rather than an independent result. Footnotes on the announcement condition the figure on hardware, batch size, and quantization. Treat the throughput as real but specific to Google's own test conditions until third-party benchmarks land.
What makes the speedup possible is the model architecture. Standard language models generate text autoregressively, meaning each token depends on every previous token and decoding is gated by memory bandwidth rather than raw compute. DiffusionGemma, by contrast, fills a 256-token block in parallel using bidirectional attention, then iteratively refines the block until it converges. The shift moves the bottleneck from memory bandwidth to compute, which favors single-user, low-batch-size scenarios on a single accelerator. In a high-QPS cloud serving context, where autoregressive models can be batched to saturate hardware, the diffusion advantage shrinks or even inverts.
DiffusionGemma is a Mixture-of-Experts model with 26 billion total parameters and 3.8 billion active per forward pass. Quantized, it fits within roughly 18 GB of VRAM, putting it within reach of a high-end consumer GPU. Google is explicit that output quality is below the standard Gemma 4 autoregressive model and recommends AR Gemma 4 for production deployments where quality is the priority. DiffusionGemma is positioned for speed-critical, local, interactive workflows rather than as a drop-in replacement.
Where the design pays off is in tasks that benefit from bidirectional context. Code infilling, where the model sees the surrounding code on both sides of a cursor, is a natural fit. In-line editing of existing prose, where the model can rewrite a passage while conditioning on what comes before and after, also works well. Non-linear text structures, including amino acid sequences and mathematical graphs, are areas where the bidirectional attention unlocks behavior autoregressive models handle poorly. The block-wise self-refinement is a real capability: the model can revise its own output within a block rather than being locked into its first guess.
For developers deciding whether to integrate, the tooling story is concrete. NVIDIA provides NVFP4 kernels targeting Hopper and Blackwell data center cards along with RTX 5090 and 4090 consumer cards. Red Hat and vLLM have integration posts. Unsloth ships a fine-tuning path with a Sudoku fine-tune demo. Hugging Face has Transformers support and a text-to-3D SVG demo. MLX is in the partner list. llama.cpp support is listed as coming soon. The breadth of partner integration lowers the barrier to running the model on existing developer hardware, but it does not change the underlying trade-off: if the priority is maximum quality in a production setting, Gemma 4 autoregressive remains the better choice.
The architecture draws on prior Gemini Diffusion research, and the release credits Research Scientists Brendan O'Donoghue and Sebastian Flennerhag. What is genuinely new for local developers is the combination of permissive licensing, single-GPU throughput, and a non-autoregressive architecture that supports tasks AR models handle poorly. The honest read is that DiffusionGemma expands what a solo developer or small team can do on consumer hardware, while leaving the high-quality production lane to standard Gemma 4.