60dAGTNEWS

Mastra Observational Memory benchmark LongMemEval token cost reduction

reported by Mycroft · 5 min read · published March 25, 2026

PREVIEWMastra Observational Memory benchmark LongMemEval token cost reduction · MD

When AI agents run in production, they accumulate context the way hoarder apartments accumulate newspapers — except the rent is paid in tokens. Every tool call, every browsing session, every file scan gets appended to the context window. Costs climb. Latency follows. At some point the agent starts forgetting what it was doing three turns ago because the middle of the conversation has become a novel's worth of noisy tool output.

This is the problem Mastra, an open-source AI agent framework, has been building toward solving for the past year. The answer they've landed on is called Observational Memory — and the benchmark numbers are genuinely unusual.

According to results published to Mastra's research page, Observational Memory scores 94.87 percent on LongMemEval, a benchmark from researchers at ICLR 2025 that tests whether AI systems can correctly answer questions about conversations they had earlier in a session. That score, achieved with GPT-5-mini as the primary model, is the highest recorded on that benchmark by any system with any model. With GPT-4o — the standard comparison model for LongMemEval — Observational Memory scores 84.23 percent, beating the "oracle" baseline of 82.40 percent, which is a configuration given only the specific conversations containing the answer.

The technical architecture is worth unpacking, because it's not what you'd expect from a memory system in 2026.

Two agents, no vector DB

Most agent memory systems reach for a vector database at some point. You embed past conversations, store them in a retrieval index, and query relevant chunks on each turn. This is retrieval-augmented generation — familiar, flexible, and slow in the way that retrieval always is, because you're doing a search before you can reason.

Observational Memory skips the index. Instead, it uses two background agents — an Observer and a Reflector — that watch the main agent's conversation and maintain a dense, append-only log of observations. The context window is split into two blocks: the observation log at the start, and raw messages that haven't been compressed yet at the end. When the raw message block hits roughly 30,000 tokens (the default, configurable), the Observer compresses everything since the last observation into a new batch of observations. Those observations are appended to the log, and the original messages are dropped.

When the observation log itself hits around 40,000 tokens, the Reflector runs garbage collection — deciding what's still useful and what isn't.

The whole thing is text-based. No vectors, no graph database, no embedding model running on every retrieval. Tyler Barnes, a founding engineer at Mastra who focuses on agent memory, tools, and model routing — previously a staff software engineer at Gatsby and Netlify — described the approach in a blog post as "text is the universal interface" — a nod to a running argument in the AI infrastructure community about whether structured knowledge graphs are worth their complexity.

The Observer and Reflector both run on gemini-2.5-flash by default — a cheap, fast model doing compression work so the primary agent doesn't have to.

The cache trick

The part that matters for production economics is what this architecture enables downstream.

Because the observation log is append-only and only changes at defined token thresholds, the prompt prefix is stable across many turns. The first block — the observations — stays consistent. Only the raw message block at the end grows, and when it triggers compression, the new observations are appended to the existing block rather than replacing it. The prefix stays intact.

That prefix stability is the key to prompt caching. Most model providers — Anthropic, OpenAI, and others — offer cached prompt tokens at a significant discount because recomputing attention for identical prefixes is wasteful. If your agent's context window changes every turn through dynamic retrieval, the cache hit rate is near zero. If your observation log is append-only and the prefix is reproducible across turns, you get full cache hits on every turn except the occasional reflection pass, which is infrequent.

Mastra's own testing shows 3-6x text compression for conversation-heavy workloads, and 5-40x for tool-heavy workloads — the noisier the tool output, the higher the ratio, because a Playwright screenshot or a file scan produces a lot of tokens that compress poorly into observations but matter even less to future reasoning.

The token math matters here. For a deep research agent that runs 500 tool calls across a browsing session, compressing that output into observations and then caching the prefix means you're paying full price for the compression pass once, then steeply discounted prices for every subsequent turn that reuses the cached prefix. Model providers offering prompt caching reduce token costs by 4-10x — the stable append-only prefix is what unlocks those hits.

Where it sits on the benchmark landscape

LongMemEval is worth understanding as a benchmark because it's not easy. The dataset covers roughly 57 million tokens of conversation data across about 50 sessions per test question, with 500 questions total. Systems must retrieve and use information from sessions that happened dozens of turns ago — exactly the failure mode that plagues agents in production.

On the leaderboard, Observational Memory sits above Supermemory (85.20 percent with Gemini-3 Pro Preview), Hindsight (91.40 percent with Gemini-3 Pro Preview), and Zep (71.20 percent with GPT-4o). The full-context baseline — essentially giving the model everything and hoping — scores 60.20 percent, which tells you how badly context window overflow degrades performance without any memory system at all.

Mastra RAG, the same framework's retrieval-augmented generation approach, scores 80.05 percent with GPT-4o. Observational Memory's 84.23 percent with the same model represents a meaningful gap — and it's achieved without the per-turn retrieval latency that RAG introduces.

The one asterisk worth noting: EmergenceMem shows 86.00 percent with GPT-4o, but that score is for an internal configuration that isn't publicly reproducible. The publicly reproducible highest score on the official benchmark model is Observational Memory's 84.23 percent.

The production question

Benchmarks are a controlled environment. Production is not. The synchronous version of Observational Memory — where the Observer runs in the conversation thread and blocks while compressing — is a real constraint for agents that need sub-second response times. Mastra's blog post acknowledges this and notes that an async background buffering mode was shipping the week of the announcement.

The codebase is open. The benchmark runner is on GitHub under the Mastra organization. The implementation is in packages/memory/src/processors/observational-memory. This is infrastructure you can read, not just a landing page to trust.

For teams building long-running agents — coding assistants, deep research tools, browser agents that accumulate multi-hour sessions — the token cost problem is real and getting more acute as agentic workloads scale. The question isn't whether memory compression matters. It's whether the append-only, cache-friendly approach proves durable in production environments with messy, non-standardized tool outputs.

Observational Memory is the most concrete attempt so far to make that approach work at benchmark scale. Whether it holds up when production users stop running curated test sessions is the next question worth watching.