GPT writes, Claude reviews, and Microsoft watches the benchmark climb.
Microsoft's Copilot now runs a two-model critique loop — GPT generating a first draft, Claude evaluating it for factual accuracy, source quality, and completeness before the report reaches the user. The feature, called Critique, is live in Microsoft 365 Copilot's Researcher agent, with plans to run the pipeline in both directions eventually. A second feature, Council, surfaces side-by-side responses from multiple models so users can see where they agree and where they diverge.
The headline number is real. Critique scores +7.0 points on DRACO's aggregated score over Perplexity Deep Research — a 13.88% improvement — evaluated using GPT-5.2, the strictest of the three judge models reported in the paper, according to Microsoft's Tech Community announcement (source). DRACO, published on arXiv in February 2026 by ten researchers — Joey Zhong, Hao Zhang, Clare Southern, Jeremy Yang, Thomas Wang, Kate Jung, Shu Zhang, Denis Yarats, Johnny Ho, and Jerry Ma, all with confirmed Perplexity connections — evaluates on four weighted dimensions: factual accuracy (52%), analytical depth (22%), presentation (14%), and source attribution (12%), across 100 complex deep research tasks spanning 10 domains (source). The improvement is statistically significant in 8 of 10 domains (paired t-test, p < 0.05).
This result comes with a triple conflict of interest worth disclosing plainly: the DRACO benchmark was authored by a Perplexity team whose own product — Perplexity Deep Research — serves as the comparison baseline, and the evaluation was graded by GPT-5.2, the strictest of three judges used in the paper — meaning the company being measured built the test, the company being measured built the baseline, and the grader is not a neutral referee either.
But benchmarks measure narrow tasks, not the full cost-latency-quality tradeoff of running two models sequentially.
The more interesting story is architectural. Microsoft has imported a practice from academic publishing — peer review — into the AI workflow. By separating generation from evaluation, the company is betting that neither task competes with the other's compute budget. "By giving evaluation as much emphasis as generation, this architecture creates a powerful feedback loop," the Tech Community post notes.
This is also a bet on orchestration over ownership. Microsoft has relationships with both OpenAI and Anthropic, and Critique uses models from both. The company isn't choosing — it's arbitraging. With Copilot adoption at roughly 3.3% of commercial Microsoft 365 users — 15 million of 450 million subscribers — Microsoft needs demonstrable quality gains to move the needle, and is apparently willing to pay the compute cost of a two-model pipeline to get them, according to Motley Fool's analysis of Microsoft's earnings disclosures (source).
Critique and Council are broadly available in Microsoft's Frontier program as of March 30, 2026. Copilot Cowork — the longer-horizon task-delegation tool built on Anthropic's Claude Cowork technology — is also available through Frontier. Capital Group, the asset manager, is already using it. "It's connecting steps, coordinating tasks, and following through across everyday workflows," said Barton Warner, SVP of Enterprise Technology at Capital Group, per Microsoft's blog (source).