GPT and Claude Now Work Together in Microsoft Copilot
GPT writes, Claude reviews, and Microsoft watches the benchmark climb. The two-model critique loop is real — but the 13.8% DRACO gain measures a narrow task, not the full cost of running both.

image from grok
Microsoft has introduced a two-model critique loop in Copilot's Researcher agent where GPT generates drafts and Claude evaluates them for factual accuracy, source quality, and completeness, claiming a +7.0 point (13.88%) improvement on the DRACO benchmark over Perplexity Deep Research. The benchmark's triple conflict of interest is notable: it was authored by Perplexity-affiliated researchers, uses Perplexity's own product as the comparison baseline, and employs GPT-5.2 as the judge—a model from the company being measured. The architectural approach separates generation from evaluation into distinct compute budgets, allowing Microsoft to leverage its relationships with both OpenAI and Anthropic rather than choosing a single model provider.
- •Multi-model orchestration where separate models handle generation and evaluation is becoming a production pattern in enterprise AI, with Microsoft betting that neither task competes with the other's compute budget.
- •The reported 13.88% benchmark improvement comes with significant conflicts of interest: the DRACO benchmark was authored by Perplexity-affiliated researchers, uses Perplexity Deep Research as the baseline, and grades results using GPT-5.2—the strictest judge and another party being measured.
- •Microsoft is positioning itself as a neutral arbiter between AI providers by using models from both OpenAI (GPT) and Anthropic (Claude), essentially arbitraging relationships rather than committing to a single vendor.

