Shopify's CTO says nearly every employee now uses an AI tool every day. One consequence he has described from his internal experience: AI-generated code is producing more bugs in production, even though the code models write is cleaner on average than what humans produce. Mikhail Parakhin, Shopify's chief technology officer, made that observation in a wide-ranging interview with the Latent Space AI Engineer Podcast published April 22. The rest of the conversation explains how Shopify got here and what it built to handle the consequences.
Shopify has been tracking internal AI tool adoption since models became usable, and Parakhin shared the charts. The green line representing total daily active AI tool users now approaches [100] by now, he said, pointing to a slide where the figure appears. It's hard not to do your job now without interacting deeply with at least one tool. The inflection point he identifies is December 2025. Small improvements accumulated into this big change, he said. The company isn't treating this as a management success. It reads more like a tipping point nobody quite predicted: the models crossed some threshold and adoption exploded regardless of policy.
Shopify's policy response was to remove the cost ceiling and set a quality floor. The company funds unlimited tokens for all employees. But rather than capping spending, it sets a minimum: don't use anything less capable than Claude Opus 4.6. Some employees are running GPT-5.4 Extra High. The leaderboard of top token users has Parakhin himself near the top, as First Round Capital reported in an interview with Farhan Thawar, Shopify's vice president of engineering. A separate Bessemer Venture Partners interview with Thawar corroborates the adoption picture: roughly 20 percent productivity gains, unlimited tool spending, and non-engineers using Cursor across the company.
One counterintuitive finding: CLI-based coding tools are growing faster than IDE-based tools. Claude Code, Codex, Pi, and Shopify's internal River agent are accelerating. GitHub Copilot and Cursor are still growing, but slower. IDE kind of tools, they're not experiencing as fast of a growth, Parakhin said. The bottleneck in AI coding is no longer getting the model to write something reasonable. The bottleneck is review, CI/CD, and deployment stability.
This is where the bugs observation lands. If a model writes ten functions and nine are excellent and one is quietly wrong, the overall code quality statistic looks great. Production doesn't care about averages. One wrong function ships. Parakhin has observed that models writing cleaner code on average than humans produce can still increase production bugs. He didn't provide specific metrics on the bug rate itself. The mechanism he describes is coherent: better average quality distributed over more code means more good code and proportionally more edge-case failures from code that never would have been written at all.
The other finding that should make vendors nervous: running many agents in parallel is nearly useless. That's almost useless, compared to just fewer agents and burns tokens very efficiently, Parakhin said. The pattern that actually works is a critique loop between two agents using different models. One writes; one critiques. Ideally the second model is different from the first, to avoid shared blind spots.
This is the opposite of what most agent-based product pitches look like in 2026, which tend to emphasize swarms, parallelism, and scale as the selling point. At Shopify's scale — 40 million LLM calls and approximately 16 billion tokens per day, according to a Shopify engineering post — the answer is fewer agents communicating better.
To get from where the company was to running agentic workflows at production scale, Shopify had to build infrastructure that doesn't exist off the shelf. Three systems are doing most of the work: Tangle, Tangent, and SimGym.
Tangle is the oldest. Shopify open-sourced it in December 2025 with a technical writeup by Alexey Volkov, a staff engineer in Shopify's search product and tooling organization. The problem Tangle solves is reproducibility in ML and data workflows. Every dataset transformation, model training step, and data pipeline output gets a content-addressed hash. If the inputs haven't changed, Tangle reuses the cached output instead of recomputing.
That sounds simple. The results aren't. Since using it, we have racked up more than a year of compute time savings, the engineering post says. Content-addressed caching creates network effects inside an organization: the more teams use it, the more likely a new computation hits an existing cache entry from a different team's prior work. Tangle is language-neutral, runs as a directed acyclic graph, and integrates with existing CI/CD tooling. The comparison Parakhin draws is to Apache Airflow, a widely used workflow orchestration tool, but Tangle is designed for ML reproducibility rather than general data pipeline scheduling.
Tangent is less documented externally. Parakhin describes it as an auto-research system: an agent that runs optimization loops automatically against whatever objective you give it. The current applications include prompt compression, search ranking, storefront theme optimization, and storage tuning. A product manager defines the objective. Tangent explores the solution space. Tangle handles the reproducibility and caching so Tangent's explorations don't re-run expensive computations.
Because Tangle caches all intermediate results, Tangent can run thousands of variations cheaply. Parakhin says this is making AutoML feel genuinely useful in a way it never did before large language models. The earlier wave of AutoML tools required structured, clean datasets and tightly defined search spaces. Tangent can work from a prose objective description and reason about the search space itself.
The limitation he acknowledges: auto-research still falls short on problems where the evaluation function is hard to define or where ground truth requires human judgment.
SimGym has the most public documentation of the three systems. Shopify opened it to all eligible merchants in AI research preview in March 2026. A detailed engineering post published in February 2026 describes the infrastructure.
The concept is simulated customers. SimGym runs AI agents that behave like real Shopify shoppers, each using Shopify's historical customer behavior data to model how buyers from specific categories actually move through storefronts. The goal is to let merchants test changes on their stores before shipping them to real customers, rather than running A/B tests that require live traffic.
The infrastructure behind that is substantial. Up to 2,000 concurrent Chromium browser sessions run through Browserbase, a cloud browser platform. Each simulated shopping session accumulates 89,000 to 127,000 tokens. The model powering the simulation is gpt-oss-120b, a 120-billion-parameter mixture-of-experts model. Running this at scale required Shopify to benchmark GPU hardware: the Nvidia H200 delivers 11,000 tokens per second per card; the Nvidia B200 delivers 57,000. That's a 5.2x throughput advantage. Shopify chose B200s for SimGym. A 20 percent reduction in average LLM latency cut cost per merchant simulation run by roughly 10 percent and increased daily throughput by 12 percent.
Shopify contributed the performance improvements upstream. Two pull requests to vLLM and one to FlashInfer came out of the SimGym work. The moat Parakhin claims for SimGym is the data. Anyone can simulate a customer. Only Shopify has decades of real transaction history across millions of merchants to ground what simulated customers actually do.
The three systems are more interesting together than separately. Tangle caches SimGym's expensive simulation runs so they don't recompute when parameters haven't changed. Tangent can run optimization loops against SimGym's output, automatically searching for storefront configurations that convert best under simulated conditions. The combination gives Shopify a feedback loop: simulate behavior, cache the results, automatically search for improvements, simulate again.
That loop is what Parakhin means when he says the three systems become much more powerful when combined. Each system is individually useful. Together they create something that requires either building all three or going without: a self-improving system that uses historical data to simulate customer behavior, caches expensive computations, and automatically searches for optimizations. No off-the-shelf product does all three in an integrated way.
On the same day the podcast published, Liquid AI announced a multi-year partnership with Shopify to deploy its liquid foundation models in production. The first deployment is a text model for search that runs in under 20 milliseconds. Liquid AI, a Cambridge, Massachusetts-based AI company, builds models on a non-transformer architecture it calls liquid foundation models. The company claims its models achieve comparable benchmark performance to transformer-based alternatives with roughly 50 percent fewer parameters, running two to ten times faster. Those claims come from Liquid AI's own press release and have not been independently verified.
Parakhin describes Liquid's architecture as the first genuinely competitive non-transformer architecture he has used in practice. The coordination between the Liquid AI announcement and the Latent Space podcast is obvious: both went public April 22. This is standard PR sequencing, not independent validation. The production deployment and the benchmark claims warrant skepticism until independent testing appears.
At 40 million LLM calls a day, Shopify has more operational data on what works in production agent infrastructure than most companies will ever accumulate. The conclusions Parakhin draws from that data run against several common assumptions in the AI tooling market. IDE-based coding tools are not how serious engineering teams are organizing around AI at scale. Parallel agent architectures waste tokens. Better average code quality doesn't translate to fewer production bugs. The real work is the review layer, not the generation layer.
The systems Shopify built to address those problems are now public. Tangle is open source. SimGym is available to merchants. Tangent is described in enough detail to understand the architecture. Whether those systems remain a durable competitive advantage depends on how long it takes others to replicate them, and whether the moat Parakhin describes in SimGym — the historical transaction data — actually provides the behavioral grounding that differentiates its simulations from what anyone else could build with the same models.
That part he can't prove in a podcast. It would require running SimGym against real merchant conversion rates and seeing whether the simulations predicted what actually happened. That data, if it exists, hasn't been published.