The 87 percent warehouse problem: what Google's Multimodal Embedding 2 launch got right and what it overstated
Google's Gemini Embedding 2 reached general availability last week, and the announcement is exactly the kind of thing that gets dismissed as an API update — until you look at what the early customers are doing with it.
This is not a benchmark story. Google published text, image, and video task numbers, and those are fine, but they do not tell you why a logistics engineer at a clothing rental company or a legal research platform cares. What matters is what the numbers mean for pipelines that already exist in production.
The multimodal RAG problem nobody talks about
Most embedding models are text-first with images bolted on as an afterthought. You embed the image, you embed the text, they live in different spaces, and you try to fix the mismatch downstream. That is a brittle pipeline that breaks the moment a query crosses modalities — "find the product matching this warehouse floor photo" is a text-plus-image query against a catalog that might be pure text, pure image, or some hybrid.
Gemini Embedding 2 is built differently. It processes text, images, video, audio, and PDFs in a single call — up to 8,192 tokens, six images, 120 seconds of video, 180 seconds of audio, and six PDF pages per request — and maps all of it into one unified embedding space. That eliminates the cross-modal matching problem at the infrastructure layer instead of patching it at the application layer.
The practical implication: a single retrieval pipeline can handle queries that would previously have required separate text and image indexes, separate ranking logic, and glue code to merge results. The consolidation matters for any team running vector search at scale.
The task prefix trick is the real API design
Buried in the announcement is a detail that most coverage will miss: the model supports task-type prefixes that you apply at both index time and query time.
``python
def prepare_query(query):
return f"task: question answering | query: {content}"
def prepare_document(content, title=None):
return f"title: {title} | text: {content}"
``
This is not new — semantic search practitioners have used query expansion tricks for years — but the fact that Google is baking it into the embedding model as a first-class API pattern suggests the model is explicitly trained to treat these prefixes as routing signals. Apply the same prefix family at both ends of your retrieval pipeline and you get meaningfully better matches than raw semantic similarity alone. The task prefix bridges the length and formality gap between a short user query and a long document.
This is the pattern that Harvey and Nuuly are relying on, even if they do not describe it this way.
What the production numbers actually mean
Three customers, three different domains, three different metrics — the pattern is consistent even if the Supermemory claim requires a caveat.
Nuuly, URBN's clothing rental business, uses the model for an in-house visual search tool that matches warehouse floor photos against their catalog to identify untagged garments. Their Match@20 accuracy went from 60 percent to nearly 87 percent, and their total product identification rate climbed from 74 percent to over 90 percent. This is warehouse logistics, not a research demo — a 27-point jump in match accuracy at rank 20 translates directly to fewer manual lookups and faster processing on a physical sorting floor.
Harvey, the legal research platform, saw a 3 percent increase in Recall@20 precision on legal-specific benchmarks. In a domain where precision at depth matters — you are finding all relevant precedents, not just the top result — a 3 percent lift on recall-oriented metrics is material.
Google's own announcement also credits Supermemory with a 40 percent increase in Recall@1 accuracy after integrating Gemini Embedding 2. Supermemory is building a "vector database for memory" for developers working on agentic applications. The company's own blog, however, is worth reading in full: their recent headline result — approximately 99 percent accuracy on the LongMemEval benchmark — comes from an experimental agentic pipeline they call ASMR, which they explicitly describe as separate from their production engine. The 40 percent figure Google cites appears to reference the production integration, but the two numbers are not the same system. Treat Google's claim and Supermemory's headline as related but distinct data points.
The Matryoshka trick: precision on a budget
Gemini Embedding 2 uses Matryoshka Representation Learning (MRL), trained so the embedding vector degrades gracefully when truncated. The default is 3,072 dimensions, but you can cut that to 1,536 or 768 with predictable accuracy loss rather than random noise injection.
For high-volume retrieval workloads where storage and compute are cost constraints, MRL means you can serve 768-dimension embeddings in production and upgrade to 3,072 for a final ranking pass — rather than running everything at maximum dimensionality. Google is not inventing this technique, but baking it into a production multimodal model as a default behavior signals that embedding storage costs are expected to matter to the teams adopting this.
Batch API and the RAG economics question
The Batch API runs at 50 percent of the default embedding price for higher throughput. Embedding ingestion — chunking, encoding, and storing documents — is the invisible cost in RAG pipelines. You tune query-time retrieval carefully, but bulk indexing runs as a background job and nobody tracks its per-document cost closely.
If batch embedding pricing holds and throughput scales, teams running large document corpora — legal, academic, regulatory filings — have a meaningful incentive to re-evaluate their indexing pipelines. A 50 percent price reduction on bulk operations changes build-versus-buy math for vector database vendors.
What this connects to
type0 has covered the agentic AI governance gap and the fragmentation of agent frameworks in recent weeks. Embedding models sit beneath all of that: they are the retrieval layer that determines what context an agent actually sees.
When Supermemory says embedding quality directly improves Recall@1 for agent memory retrieval, that is not a niche claim. If agent memory retrieval is the substrate on which autonomous agent quality depends — and it is — then every meaningful improvement in embedding accuracy compounds. The gap between a 60 percent and 87 percent match rate at rank 20 is the difference between an agent that finds what it needs and one that does not.
The embedding layer is not glamorous infrastructure. But it is the foundation that decides whether agentic pipelines work in practice.
Primary sources