NVIDIA AI-Q Hits #1 on DeepResearch Bench I and II
NVIDIA's AI-Q deep research agent has claimed the top spot on both DeepResearch Bench I and DeepResearch Bench II, according to the official leaderboards for both benchmarks. The system scored 55.95 on Bench I and 54.50 on Bench II, making it the first open blueprint to lead both benchmarks simu...

NVIDIA's AI-Q deep research agent has claimed the top spot on both DeepResearch Bench I and DeepResearch Bench II, according to the official leaderboards for both benchmarks.
The system scored 55.95 on Bench I and 54.50 on Bench II, making it the first open blueprint to lead both benchmarks simultaneously. The results were added to the Bench I repository on March 8, 2026.
What AI-Q is: According to NVIDIA's technical documentation, AI-Q is an open blueprint for building AI agents that reason over enterprise and web data to deliver well-cited responses. The deep researcher component — the part that scored on these benchmarks — uses a multi-agent architecture with three core components: an orchestrator that coordinates the research loop, a planner that maps the information landscape, and a researcher that dispatches parallel specialists to gather and synthesize evidence.
The stack runs on NVIDIA NeMo Agent Toolkit and LangChain DeepAgents, powered by fine-tuned Nemotron 3 Super models. Enterprises can inspect, customize, and configure the system per their use case.
Why it matters: DeepResearch Bench I evaluates report quality against reference reports across comprehensiveness, depth of insight, instruction-following, and readability. DeepResearch Bench II uses over 70 fine-grained rubrics per task to check information retrieval, synthesis, and presentation. Leading on both suggests AI-Q produces both polished narratives and granular factual correctness.
NVIDIA frames this as "a meaningful step for open, portable deep research," arguing that developer-accessible models and tooling can power state-of-the-art agentic research.
The open question: Independent validation of how these benchmark scores translate to real-world enterprise research performance remains untested. Benchmarks measure what models can do in controlled conditions; what they do in messy, real enterprise environments is a separate question that the leaderboard data alone cannot answer.
This article synthesizes NVIDIA's technical documentation and Hugging Face reporting with direct verification against the live DeepResearch Bench I and II leaderboards. The benchmark scores were confirmed against the official repositories.
Sources
- agentresearchlab.org— DeepResearch Bench II Official Leaderboard
- huggingface.co— Hugging Face Blog
- github.com— DeepResearch Bench I GitHub Repository
- github.com— DeepResearch Bench II GitHub Repository
Share
Related Articles
Stay in the loop
Get the best frontier systems analysis delivered weekly. No spam, no fluff.
