For most of the past two years, multimodal AI has meant a stack: a vision encoder here, an audio encoder there, and a language model sitting on top stitching it all together. Google is now betting that the stack is the problem.
Gemma 4 12B, released June 3 via the Google Developers Blog, collapses the encoder stack into a single decoder-only transformer. The same backbone that handles text now eats raw 48×48 vision patches and 40-millisecond audio frames at 16 kHz, with no separate vision or audio tower in front of it. That is a concrete architectural choice, not a launch slide, and it explains every other decision in the release: a smaller memory footprint, lower multimodal latency, audio on a mid-sized model for the first time, and a deployment target of laptops with 16GB of VRAM or unified memory.
The shift matters because it inverts the conventional design. Other multimodal systems typically freeze a specialist encoder, fine-tune it on its modality, and feed its outputs into the language model. Google's earlier Gemma 4 sizes followed that playbook, with dedicated 150M and 550M vision encoders and a 300M audio encoder bolted on. Each encoder buys representational depth in its domain. The price is memory, latency, and the engineering cost of keeping the towers in sync with the underlying LLM.
Gemma 4 12B's authors, research engineers André Susano Pinto, Andreas Steiner, and Karolis Misiunas, research scientists Karsten Roth and Michael Tschannen, and member of technical staff Omar Sanseviero, argue in the developer guide that the encoder layer is doing less work than it looks like, at least at this scale. By projecting pixels and audio samples directly into the LLM's input space, the model skips a stage of inference and the memory cost that comes with it. The result, according to Google, is a model that fits on consumer hardware and can serve multimodal requests with measurably lower latency.
The most concrete consequence is the one developers can actually verify: audio. Every prior Gemma model that handled audio did so at the small or edge end of the family, the kind of model that runs on a phone. Gemma 4 12B is the first medium-sized Gemma that ingests audio natively, through the same backbone that handles text. For builders working on voice agents, screen-aware assistants, or anything that needs a model to hear and see at the same time, that is a new capability on hardware they already own.
Google is also shipping a dedicated multi-token prediction, or MTP, companion model alongside Gemma 4 12B. MTP is a recent trick: instead of generating one token at a time, the model proposes several tokens per step and verifies them in parallel, which can dramatically increase tokens per second on local hardware. Shipping it as a separate model, rather than baking it into the base, is a tell. MTP exists because collapsing the encoders saves memory but raises the per-token compute cost on the unified backbone, especially for long multimodal prompts. The companion is a deliberate response to that tradeoff.
The new macOS desktop app is the second tell. A downloadable consumer app for a 12B-parameter model only makes sense if the model is small and self-contained enough to run interactively on a MacBook. The developer guide explicitly targets laptops with 16GB of VRAM or unified memory, which is the kind of hardware a working developer is likely to have on their desk.
The honest counter-frame is that a unified backbone is not free. A frozen vision encoder with hundreds of millions of parameters, trained on a curated image corpus, can extract features a decoder-only transformer has to learn on the fly inside its own hidden states. The same applies to audio: a 300M audio encoder tuned for speech or environmental sound carries a representational ceiling the unified model has to match without that dedicated weight budget. Google's argument is that at 12B parameters, a single backbone can absorb the encoder's role. The benchmark the field actually cares about is whether the unified model matches or beats its encoder-stacked siblings on the tasks where encoders traditionally shine: fine-grained document understanding, optical character recognition in the wild, and noisy multilingual speech.
That comparison is not in the developer guide. The release is a launch, not a bake-off. The guide links to the model card and evaluation suite, but the numbers a buyer would want, specifically Gemma 4 12B versus an encoder-stacked multimodal peer on the same hardware, measured in tokens per second on a 16GB MacBook, will have to come from independent runs.
Three things are worth watching. First, whether third-party benchmarks on standard multimodal suites, like MMMU for vision reasoning or AIR-Bench for audio, place the unified model inside the margin of its encoder-stacked cousins or clearly behind them. Second, whether the MTP companion ships as a clean, drop-in addition to the open-weights release, or whether the inference speed it promises is gated behind Google's own serving stack. Third, whether the encoder-free pattern stays confined to 12B or moves up and down the Gemma family. If the larger Gemma 4 sizes adopt the same design, the encoder stack is on its way out. If they keep the frozen encoders, the team is making a measured bet that only pays at the size where latency and memory dominate over raw feature quality.
For now, the practical story is the one in the developer guide. A single transformer is doing what three used to do. That is the reason this model runs on a MacBook, the reason audio is finally a first-class input on a mid-tier Gemma, and the reason the MTP companion exists. The open question is whether the encoder-stacked designs that came before it, including the 150M and 550M vision encoders and the 300M audio encoder in larger Gemma 4 sizes, are now an inheritance the field is going to shed, or a depth tax worth paying for another generation.