Zyphra's Zamba2-VL puts a Mamba2 backbone behind an open vision-language model

Zyphra's Zamba2-VL puts a Mamba2 backbone behind an open vision-language model — type0 | type0

PREVIEWZyphra's Zamba2-VL puts a Mamba2 backbone behind an open vision-language model · MD

Zyphra has released Zamba2-VL, an open vision-language model family that swaps the dense Transformer language backbone used in most open VLMs for a hybrid Mamba2 state-space model interleaved with a small number of shared transformer attention blocks, each carrying its own LoRA adapter. The architecture, not the latency number in the press kit, is the part worth attention. It gives developers a reproducible path to lower first-token latency on multimodal workloads without abandoning the in-context recall that pure state-space models lose.

The family ships in three sizes, 1.2B, 2.7B, and 7B parameters, with weights publicly hosted on Hugging Face under Apache 2.0 (Zyphra's official Zamba2-VL page; HF model card for Zamba2-VL-2.7B). On the language side, Zamba2 interleaves Mamba2 sequence-mixing layers with shared attention blocks, a design previously explored in Zyphra's text-only Zamba2 work and now extended to vision-language. Each shared block carries a unique LoRA adapter, which lets the model retain the cross-token recall of attention at a fraction of the dense-Transformer parameter and compute cost. The vision side is a deliberate borrowing: the team uses Qwen2.5-VL's Vision Transformer, picked for its 2D rotary position embeddings and native dynamic-resolution processing, and connects it to the language model through a two-layer MLP projector in a LLaVA-style template.

Zyphra's headline claim is that Zamba2-VL cuts time-to-first-token by roughly an order of magnitude compared to comparable Transformer-based VLMs at matched scale (Zyphra's Zamba2-VL release page; MarkTechPost recap, 2026-06-12). That number is a vendor claim run on Zyphra's own VLMEvalKit-based harness, not an independent measurement, and the actual gain depends on hardware, precision, and sequence length. A CUDA GPU is required for the optimized Mamba2 kernels, and the linear-time properties of Mamba2 are not free inference. What they buy is cheap prefilling on long multimodal contexts, which is exactly the regime where first-token latency hurts user-facing applications.

The capability surface tracks what developers have come to expect from a modern open VLM: single- and multi-image understanding plus grounding, trained on 100B tokens of vision-text and pure-text data drawn from open web datasets with the Mistral v0.1 tokenizer (Zyphra's Zamba2-VL page). The architectural report is on arXiv as 2606.00390, with authors Hassan Shapourian, Kasra Hejazi, Olabode M. Sule, and Beren Millidge.

The benchmark picture, drawn from the same VLMEvalKit-based harness, is more interesting than the latency line. The 2.7B model is genuinely strong on counting and document tasks at its scale. On PixMoCount it posts 82.5, against InternVL3.5-2B at 32.8 and Qwen3-VL-2B at 55.7. On DocVQA it lands at 90.9, just behind Qwen3-VL-2B's 93.3 and ahead of InternVL3.5-2B's 89.4. ChartQA comes in at 79.6, comparable to InternVL3.5-2B's 81.6 and Qwen3-VL-2B's 78.7 (Zyphra's release page). The 1.2B variant follows the same shape, posting 62.5 on PixMoCount against InternVL3.5-1B at 32.8.

The weakness is on knowledge-heavy reasoning, where the smaller parameter budget shows. MMMU val comes in at 37.7 for the 2.7B, well behind InternVL3.5-2B's 49.9 and Qwen3-VL-4B's 51.4. MathVista mini is 51.0 against InternVL3.5-2B's 61.4. OCRBench posts 73.6, trailing Qwen3-VL-2B's 84.1 and InternVL3.5-2B's 83.4. The pattern is consistent: Zamba2-VL trades raw knowledge recall for inference economics, which is a reasonable tradeoff for a deployment-targeted 2.7B but not a free lunch. The 7B extends the same architectural pattern, and the Zyphra release page places it broadly in the same competitive band as larger Transformer baselines on the harness, though the per-benchmark numbers for the 7B have not been published separately in the available materials.

Inference setup is the practical gate. Zyphra pins its transformers fork at v4.57.1 with qwen-vl-utils 0.0.2 and flash_attn, and ships Zyphra-maintained forks of causal-conv1d and mamba-ssm that have to be built from source for the optimized Mamba2 kernels (HF model card for Zamba2-VL-2.7B). That is a real friction point compared to a pip install of a stock dense-Transformer VLM, and it constrains where the latency claim is observable. CPU and edge-device numbers should be read as limited to the 1.2B and 2.7B band until the optimized kernels are ported or replaced.

What to watch next: independent reproduction of the TTFT number on matched hardware and matched precision, and the full 7B benchmark row. The architectural question is the durable one. For two years the open VLM story has been "another dense Transformer, slightly larger." Zamba2-VL is the first widely available open-weights VLM where the language half is a state-space hybrid, and the shared-attention plus per-layer LoRA pattern is exactly the design that makes a hybrid Mamba2 backbone useful for in-context retrieval rather than a stand-in for one.

Zyphra's Zamba2-VL puts a Mamba2 backbone behind an open vision-language model

Sources