DiffusionGemma lands as an open-weight Apache 2 model, free to run on NVIDIA NIM

PREVIEWDiffusionGemma lands as an open-weight Apache 2 model, free to run on NVIDIA NIM · MD

DiffusionGemma is runnable today, and you can measure it yourself in under five minutes.

The artifact is google/diffusiongemma-26B-A4B-it on Hugging Face, released on 10 June 2026 under the Apache 2.0 license, per Simon Willison's link post on the model and Google's own launch blog. The "A4B" suffix signals a Mixture-of-Experts configuration. DiffusionGemma ships with 25.2 billion total parameters — 3.8 billion active per token — with 8 active experts selected from 128 total plus one shared expert. That distinction matters for the "26B" headline number that will travel with this release, and for what it actually costs to fine-tune and serve.

The model descends from Google's experimental Gemini Diffusion preview from May 2025, which Willison previously measured at 857 tokens per second. That preview was never re-announced as a Google product line, and Google has not committed to a sustained Gemini Diffusion release cadence in the material Willison cites. DiffusionGemma is the open-weight continuation of that experiment, not evidence that diffusion language models are replacing autoregressive ones. Google explicitly states that DiffusionGemma's output quality is lower than standard Gemma 4 and recommends standard Gemma 4 for production-quality applications.

What is signal is that NVIDIA is hosting DiffusionGemma for free on its NIM cloud API at launch, so the model is reachable without a self-hosted GPU. Google claims 1,000+ tokens per second on a single NVIDIA H100 (FP8) and 700+ tokens per second on an NVIDIA GeForce RTX 5090. The concrete proof point is the reproducer. Willison's timestamped run against the NIM endpoint used time uv run generate.py to render a pelican in 2,409 tokens, returning in roughly 4.4 seconds, or about 500 tokens per second. That is materially slower than the May 2025 preview's 857 tok/s on the same pelican — and also below Google's own claimed 1,000+ tok/s on H100 — but it is the first published throughput number attached to an actually downloadable, Apache 2-licensed 25.2B-class diffusion language model, measured against a real endpoint.

Three caveats travel with that number. The roughly 500 tok/s figure comes from a single short generation on one NIM endpoint, and throughput will vary with prompt length, region, and load; it should not be read as a vendor benchmark. The pelican render establishes that the model can produce a pelican in 4.4 seconds, not that it is production-ready for any specific downstream task — Google itself rates quality below standard Gemma 4. And while Google's first-party blog post confirms the release, a reporter chasing any "first open-weight diffusion LLM at this size" framing should confirm against the Hugging Face model card benchmarks before publication.

What to watch next: whether other practitioners reproduce the throughput on different prompts and hardware, and how the MoE configuration changes fine-tuning economics for teams used to dense 7B or 13B checkpoints. The numbers and the license are real, and the model is downloadable now. The broader "diffusion is back" thesis is not, yet, supported by what is actually on the page.

DiffusionGemma lands as an open-weight Apache 2 model, free to run on NVIDIA NIM — type0 | type0

DiffusionGemma lands as an open-weight Apache 2 model, free to run on NVIDIA NIM

Sources