A startup says it broke through a math bottleneck in LLMs. Independent tests partly back the claim.

PREVIEWA startup says it broke through a math bottleneck in LLMs. Independent tests partly back the claim. · MD

Subquadratic walked out of stealth last month with a pitch that landed somewhere between moonshot and mirage. The Miami-based company said it had cracked a mathematical bottleneck that has constrained large language models (LLMs) for almost a decade. The pitch came with limited evidence, and the response was predictable: another Theranos, or a real breakthrough.

One month later, Subquadratic has produced something its critics asked for: independent benchmark results. MIT Technology Review reports that Appen, a third-party AI evaluator that runs benchmarks for rival AI companies, has now tested SubQ, the startup's new model. The findings land in an uncomfortable middle: they partially validate the architecture and partially fall short of the original marketing.

That gap is the news. It is also a pattern worth learning to read.

The original claim, as reported by MIT Technology Review, was specific. SubQ processes up to 12 times as much text at once as most competing models, the company said. It runs faster, costs less, and uses substantially less energy. On key coding benchmarks, it roughly matches Google DeepMind, OpenAI, and Anthropic. The bottleneck, Subquadratic said, is mathematical and roughly a decade old in LLM research. In plain terms, the company is targeting the underlying machinery of how transformer-based language models handle long inputs: the attention mechanism that determines how much context a model can hold at once and how much compute that costs.

The first round of skepticism was about disclosure. Subquadratic shared self-published test scores and little else. SubQ's weights are not public, and the company has not released a hands-on demo. Without independent replication, an extraordinary claim is just a claim.

Appen's evaluation does not close that gap. It narrows it. According to MIT Technology Review, the independent results suggest the architecture merits attention, and Appen's director described seeing the numbers as "really exciting," saying it "validated their architecture." That reaction, attributed to Appen rather than the company, is the most credible signal in the story so far. It is not the same as peer review. It is not the same as a production system serving real users at scale. It is, however, more than the company had a month ago.

What the benchmarks do and do not show matters. On the dimensions Appen tested, SubQ appears to hold up against Subquadratic's original framing of speed and efficiency. On other dimensions, the marketing exceeded what the independent tests could confirm. For a reader, the practical lesson is that an LLM announcement can be partly true and still not shippable. Partial validation is closer to "interesting" than to "ready."

The wider context is that Subquadratic is not the only group chasing cheaper, longer-context attention. Frontier labs and academic teams have spent years attacking the same scaling problem. If Subquadratic's approach works, the question is not just whether one Miami startup ships a faster model, but whether the underlying math holds up under pressure that no benchmark can simulate: real users, real latency budgets, real adversarial inputs, real cost at scale.

Subquadratic's chief technology officer has acknowledged the rollout mistake, telling MIT Technology Review that benchmarks should have shipped first. That admission is useful color, and a useful tell. The order in which a company presents evidence is itself evidence. A company that has the third-party numbers tends to lead with them. Subquadratic, by its own telling, did not.

For now, the honest read is that SubQ is a real architecture with real benchmark support from a real evaluator, and a long way from the kind of trust that comes from public weights, public demos, and replicated third-party tests. Theranos or breakthrough is the wrong binary. The right framing is closer to: what would it take to know which one this is, and is the company willing to provide it?

That is the portable test. Extraordinary claims in AI deserve independent benchmarks. Independent benchmarks are not the same as production trust. Production trust is not the same as peer review. A reader who carries that ladder into the next SubQ-shaped announcement will be better equipped than one who only remembers this one.

A startup says it broke through a math bottleneck in LLMs. Independent tests partly back the claim. — type0 | type0

A startup says it broke through a math bottleneck in LLMs. Independent tests partly back the claim.

Sources