The brief was ten lines long. Build a React and Tailwind web app that simulates a startup founder's month-by-month decisions, with stateful metrics, AI-generated monthly reports, and a win-or-lose ending. Hand it to two flagship AI coding agents released in 2026 and watch what each one ships. That is what Chinese tech outlet Leiphone did in a hands-on review this month, and the result is a cleaner window into the state of AI coding than any benchmark chart.
M3 finished the build in roughly 11 minutes. Sonnet 4.6 took about 19. The eight-minute gap is less interesting than what each model chose to do with the extra time.
M3 stayed inside the brief. The app booted, the monthly loop ran, the AI-generated reports were coherent, and the win-or-lose state resolved. Nothing in the output tried to surprise the prompter. Sonnet 4.6 did the same job and added a layer of its own: random events that could wipe out a month's progress, a difficulty curve the brief never asked for, and richer game balance intended to make the simulation feel like a real product. Both deliverables were working software. They were different jobs.
This is not a quality ranking. Both models produced code that ran. The Leiphone test is a single build by a single reviewer, and it cannot resolve who is "better" at coding any more than one customer's review of two restaurants can resolve who cooks better in their city. What it does resolve is something the benchmark wars have obscured: the next phase of AI coding competition is not about who scores highest on a leaderboard. It is about which philosophy of AI collaboration matches how a given engineer or team actually briefs work.
MiniMax's positioning of M3 is execution-first. The model is the company's flagship for coding and agent work, and the architecture is built around long, autonomous tasks rather than quick chat exchanges. The headline technical bet is the 1M-token context window, paired with MSA (MiniMax Sparse Attention), an attention method documented in a preprint paper on arXiv and accompanied by open-source CUDA kernels. In plain English: the model can hold roughly 750,000 English words of working memory and pay attention selectively, so it does not have to re-read the entire context every time it reasons. That makes day-long autonomous work tractable in a way shorter-context models struggle with.
MiniMax showed two official demos to make the long-task case concrete. In one, M3 reportedly reproduced the implementation of an ICLR (a top machine-learning research conference) paper in about 12 hours without human intervention. In another, it optimized a CUDA kernel, the low-level GPU code that normally takes human engineers days to tune, across 147 iterations in roughly 24 hours. (VentureBeat reports these as official demos; TechTimes flags the wider benchmark claims as unverified.) Both demos are press-release evidence, not third-party reproductions. Treat them as the company telling you what the architecture is for, not as proof it works that way on every workload.
Sonnet 4.6's positioning is collaborator-first. The flagship from Anthropic is built around the assumption that a senior engineer is in the loop and wants a model that improvises, asks, and proposes improvements. The 11-versus-19-minute result is what that philosophy looks like under a tight brief: it costs time, and it produces code the prompt did not ask for. For some teams, that is the entire point. For others, it is a reason the team would rather not use the tool.
There is a third leg to the M3 release that almost got buried under the benchmark headlines, and Leiphone flags it as the most underrated part. M3 has image-to-code: feed it a screenshot or a wireframe and it produces runnable code, not just a description of what the screenshot depicts. In a year when most coding-agent demos still start from a text prompt, that is a meaningful shift in the unit of work a model can accept. It pairs naturally with the long-context story: a designer hands the model a Figma export, and the model holds the image and the surrounding code in working memory while it builds.
The price point matters if M3 is going to displace incumbents. MiniMax positions M3 at roughly 5 to 10 percent of the per-token cost of GPT-5.5 and Gemini 3.1 Pro on comparable tasks, per VentureBeat's reporting on the launch, and MiniMax's public pay-as-you-go pricing page lists rates well below the frontier incumbents. TechTimes explicitly flags the benchmark parity claims as unverified; the price gap is the part that has held up under more independent scrutiny, though it shifts with rate-card changes.
What to watch next is whether the obedient-executor style travels. M3's long-context, sparse-attention architecture is a real bet that a large share of AI coding work will be autonomous, multi-day, and judged on whether the deliverable matches the brief. If that bet pays, the M3-versus-Sonnet-4.6 split is not a quirk of two competing products. It is a fork in the road for what AI coding agents are for.