Your AI Agents Are Flying Blind Without This

Your AI Agents Are Flying Blind Without This — type0 | type0

PREVIEWYour AI Agents Are Flying Blind Without This · MD

Every agent framework ships with an evaluation tool. Most of them work the same way: run the agent, measure the output, score the result. It's the benchmark equivalent of a driving test where you rebuild the car after every lap.

Solo.io's new open-source tool, agentevals, takes a different approach. Announced at KubeCon + CloudNativeCon Europe 2026 in Amsterdam on March 25, it scores AI agent behavior directly from OpenTelemetry (OTel) traces that you already collected in production — no re-execution, no token burn, no rebuilding the car. GitHub: agentevals

"Evaluation is the biggest unsolved problem in agentic infrastructure today," said Idit Levine, Solo.io's founder and CEO, in the company's announcement. "Organizations have frameworks for building agents, gateways for connecting them, and registries for governing them, but no consistent way to know whether an agent is actually reliable enough to trust in production." GlobeNewswire

The pitch is technically differentiated from the incumbents. LangSmith, DeepEval, and LangChain's own agentevals all require re-running agents through test scenarios to generate evaluation data. Agentevals instead assumes you've already instrumented your agents with OpenTelemetry — a reasonable assumption given that OTel is the observability standard for distributed systems — and will score behavior from whatever traces you've already got. From there, you can run arbitrary eval suites against the same recorded trace data without touching the agent again.

It's a bet on existing infrastructure rather than a new testing harness, and it's the kind of thing that only works if the traces are actually rich enough to evaluate from. The tool supports Jaeger JSON and OTLP trace formats, ships with built-in evaluators and custom evaluator support, and includes LLM-based judges for scoring. It installs via pip (pip install agentevals-cli) and exposes a CLI, web UI, and MCP server. GitHub: agentevals

The framework compatibility list is notable: LangChain, Strands, and Google ADK are explicitly supported. That covers a significant chunk of the agent framework landscape, but it also means agentevals inherits whatever instrumentation gaps exist in those frameworks. If an agent doesn't produce complete OTel spans — missing tool call parameters, unrecorded tool responses, gaps in the trace chain — the eval will be working from partial data.

Solo.io is simultaneously contributing agentregistry, a registry and discovery tool for AI agents, MCP tools, and agent skills, to the Cloud Native Computing Foundation (CNCF). The project was originally introduced in November 2025 and is now entering the CNCF donation process alongside the agentevals launch. Cloud Native Now The registry integrates with Kubernetes, AWS AgentCore, and Google Vertex AI for deployment, and includes runtime discovery to detect agents running outside governed workflows — what the company calls shadow inventory. Cloud Native Now

This is the fourth layer in what Solo.io is positioning as a coherent agent infrastructure stack. Kagent, a framework for building and running AI agents natively in Kubernetes, was accepted into CNCF Sandbox on May 22, 2025 and has grown to 3,414 contributors, 1,119 stars, and 658 releases — growth the company frames as evidence of adoption, though year-over-year percentage comparisons in CNCF project announcements tend to be chosen for effect. CNCF Agentgateway, Solo.io's AI gateway with full MCP and A2A protocol support, is housed under the Linux Foundation. Agentregistry is in CNCF donation. Agentevals is the new piece that connects the registry to evaluation.

The four-layer framing is marketing architecture, not technical debt — but the underlying bet is real. If OpenTelemetry becomes the universal observability substrate for agentic systems, the tooling built on top of it inherits enormous leverage. Solo.io is positioning for that inflection point by making OTel traces do double duty: they're already there for debugging, and now they're also the input to quality evaluation.

There is a naming collision worth flagging. LangChain maintains a separate project called agentevals at github.com/langchain-ai/agentevals that uses a re-execution model — different architecture, same name. Both projects address agent evaluation, but they solve different parts of the problem and require different integration paths. GitHub: langchain-ai/agentevals

The harder question agentevals doesn't fully answer is what the Kang et al. research raised about benchmark validity: evaluation tools measure what you can measure, not whether the measurement correlates with real-world agent reliability. Scoring from traces is genuinely novel. Whether those traces contain the signal that actually predicts production quality is still an open problem — and one that better tooling, by itself, cannot solve.

Your AI Agents Are Flying Blind Without This

Sources