DeepMinds Simula Gets Better at Math, Worse at Law
DeepMind's reasoning-first system is already running inside Google's products. The same system that improved math accuracy by 10 percent made legal reasoning worse.
The system, called Simula, is live as the primary synthetic data backbone for Gemini's safety classifiers, the production code that determines what Google's AI will and won't output, according to a Google Research Blog post published three days ago. It is not a research prototype. The underlying Simula paper appeared on arXiv in April and was published in the Journal of Machine Learning Research's TMLR track.
The trade-off is the part that matters. High complexity in synthetic data generation yielded a 10 percent accuracy improvement on GSM8k math benchmarks but degraded performance on LEXam legal reasoning tasks, where the teacher model was weaker, according to that same Google Research Blog post. The relationship between data complexity and task performance is domain-dependent. It is not a universal capability gain. The paper's own framing: better data scales better. Not more data. Better data.
What makes the legal reasoning result significant is what legal reasoning requires. Reasoning about contracts, liability, and regulatory compliance is precisely the domain where you want a system that verifies its own logic rather than predicting the next plausible word. Simula's degradation there is not a benchmark curiosity. It is the failure mode in the use case where reasoning matters most.
The competitive context is not academic. OpenAI and Anthropic are pursuing the same capability target, reasoning models that verify their own logic rather than predicting the next token, through approaches internally referred to as Q-star or Strawberry-style reasoning, according to people familiar with those companies' internal discussions, Startup Fortune reported. Raia Hadsell, Google DeepMind's VP of Research who co-leads the Frontier AI unit, has been at this longer than most. Her contributions run through Gemini 2.5, Gemma 2, RecurrentGemma, and RoboCat, according to the RAAIS Summit speaker page. The RL feedback loop her team uses mirrors what made AlphaGo work: a model that plays against itself, fails, updates, and gradually learns strategies no human trainer explicitly encoded.
Google is not alone in this bet. The company has committed tens of billions to AI infrastructure; test-time compute scaling offers a way to extract more capability from existing hardware rather than indefinitely expanding it. If it works, it reframes the investment thesis for AI infrastructure across the industry.
What DeepMind has built is a synthetic data factory purpose-built for reasoning tasks, deployed in production, and operating at scale. Whether that factory produces genuine reasoning or extraordinarily sophisticated pattern matching is the question the next phase of this race will answer. The math-up, legal-down result is the first concrete evidence of what that distinction actually looks like in a live system.