Microsoft's Copilot Now Has Its Rivals Review Each Other's Work
Microsoft's new Copilot doesn't want to pick a favorite AI model.

image from Gemini Imagen 4
Microsoft unveiled a 'Critique' feature in Copilot Researcher that runs a two-model loop: GPT drafts responses and Claude reviews them for accuracy and completeness, marking a notable collaboration between AI rivals within a single product. The multi-model approach delivered a 13.8 percent improvement on the DRACO benchmark, outperforming standalone deep research tools from Perplexity, Claude Opus, Gemini, and OpenAI. This strategic shift appears designed to address Microsoft's eroding market position in paid AI assistants, where its market share contracted 39 percent over six months as users increasingly opt for direct model access over platform intermediaries.
- •GPT-Claude critique loop achieved 13.8% improvement on DRACO benchmark, outperforming dedicated research tools from Perplexity, Claude Opus, Gemini, and OpenAI
- •Microsoft's paid AI chatbot market share dropped from 18.8% to 11.5% in six months—a 39% contraction—signaling users prefer accessing models directly
- •Copilot's 6M DAU lags far behind ChatGPT's 440M, despite Microsoft's massive enterprise distribution through 450M M365 seats
Microsoft's new Copilot doesn't want to pick a favorite AI model. That might be the most revealing thing about it.
The company announced this week that Microsoft 365 Copilot Researcher, its most advanced AI assistant tier, is getting a "Critique" feature that runs a two-model loop: GPT drafts a response, then Claude reviews it for accuracy, completeness, and citation quality before delivery. The setup puts OpenAI's text generator and Anthropic's reasoning model in the same workflow — rivals operating as collaborators inside a Microsoft product. GeekWire reported the rollout this week.
Microsoft says the multi-model approach produced a 13.8 percent improvement on the DRACO benchmark, a deep research evaluation covering 10 domains, 40 countries, and real-world queries with 20.5 factual accuracy criteria per task. The company claims Researcher with Critique outperforms Perplexity Deep Research, Claude Opus, Gemini Deep Research, and OpenAI's own Deep Research on the same measure. The AI Economy first flagged the DRACO numbers and benchmark positioning.
That's the product story. Here is the business story underneath.
As of Q2 FY2026 earnings in January, Microsoft reported 15 million paid Copilot seats across its commercial M365 base of approximately 450 million users — SameExpert calculated a 3.3 percent penetration rate. The raw seat count sounds large until you put it next to ChatGPT, which reported 440 million daily active users in March, or Anthropic's Claude, which hit 9 million DAU the same month, per Sensor Tower data cited by CNBC. Copilot's daily active user count: 6 million in February. The gap is not narrowing quickly.
More concerning for Microsoft: its share of the paid AI chatbot subscriber market contracted from 18.8 percent in July 2025 to 11.5 percent in January 2026 — a 39 percent contraction in six months, SameExpert found. That's not a product problem yet. It's a distribution and habit problem. People who pay for AI assistants are increasingly choosing the source model directly rather than going through Microsoft's layer.
The Critique feature is an answer to that. Rather than betting on a single frontier model, Microsoft is betting on being the platform where models work together. Mustafa Suleyman, who runs Microsoft's AI unit, put the logic plainly in a March interview with CNBC: "Most of the future value is going to accrue to the model layer, and my job is to create highly COGS-optimized, highly efficient enterprise-specific model lineages for Microsoft over the next three to five years. That is singularly the objective, precisely because the model is the product." The hedge is visible in that sentence — Microsoft needs the model layer to generate value, which means it needs to own more of it.
Which brings us to MAI. Microsoft's AI Superintelligence team has been releasing models that are starting to compete with the labs it still depends on. MAI-Image-2 reached third place on the Arena leaderboard, trailing only Google Gemini 3.1 Flash and OpenAI's GPT-Image 1.5, WinBuzzer reported. More pointed: MAI-DxO paired with OpenAI o3 correctly solved 85.5 percent of NEJM benchmark cases — a result Microsoft published in a post titled "The Path to Medical Superintelligence." The company is building models that match and sometimes exceed the models it resells.
The 13.8 percent DRACO improvement is worth flagging before it becomes a number that circulates without context: Microsoft reported it against Perplexity's own benchmark. That is a conflict of interest in benchmarking terms — the evaluated system's creator is grading on a test the evaluated system chose. Independent replication doesn't exist yet. DRACO itself is a legitimate benchmark; the self-evaluation is the caveat.
The broader reorg adds texture to the strategy. Rajesh Jha, Executive Vice President of Experiences + Devices, announced his retirement March 12 after 35 years at Microsoft, stepping down July 1. Jacob Andreou, a former Snap executive, is taking over oversight of both consumer and commercial Copilot experience and will report directly to Satya Nadella, CNBC reported. The org chart change puts AI experience leadership at the CEO level for the first time — a structural acknowledgment that Copilot is now core, not experimental.
Microsoft's IP agreement with OpenAI runs through 2032, CNBC confirmed, so the partnership isn't ending tomorrow. But the direction of travel is clear. Microsoft is simultaneously the world's largest AI model distributor and the most motivated builder of alternatives. The Critique feature is clever product engineering in the interim. MAI is the longer bet.
For enterprise buyers, the immediate question is whether a multi-model loop produces meaningfully better research output than a single frontier model. The DRACO number suggests something — self-reported, unidirectional, and in need of outside verification — but not enough to answer that question yet. For Microsoft, the question is simpler: can it build enough loyalty inside its 450 million-user base before the model layer decides it doesn't need an aggregator in the middle.
Editorial Timeline
8 events▾
- SonnyMar 30, 1:26 PM
Story entered the newsroom
- SkyMar 30, 1:26 PM
Research completed — 0 sources registered. Critique feature: GPT drafts inside Copilot Researcher, Claude reviews for accuracy/completeness/citation quality. 13.8% DRACO improvement (self-repor
- SkyMar 30, 1:43 PM
Draft (794 words)
- SkyMar 30, 1:43 PM
Reporter revised draft (820 words)
- GiskardMar 30, 2:14 PM
- RachelMar 30, 2:18 PM
Approved for publication
- Mar 30, 2:19 PM
Headline selected: Microsoft's Copilot Now Has Its Rivals Review Each Other's Work
Published (801 words)
Sources
- geekwire.com— geekwire.com
- theaieconomy.substack.com— theaieconomy.substack.com
- samexpert.com— samexpert.com
- cnbc.com— cnbc.com
- winbuzzer.com— winbuzzer.com
- microsoft.ai— microsoft.ai
Share
Related Articles
Stay in the loop
Get the best frontier systems analysis delivered weekly. No spam, no fluff.

