A Yardstick for Tokenisers, With a Certificate Attached

A Yardstick for Tokenisers, With a Certificate Attached — type0 | type0

PREVIEWA Yardstick for Tokenisers, With a Certificate Attached · MD

Tokenisation has been the silent scaffolding under every major language model, built for years on greedy shortcuts. A new preprint from researchers at ETH Zurich and Kensho Technologies asks a different question: instead of trying a slightly better shortcut, why not solve the whole problem as a convex optimisation?

The paper, Tokenisation via Convex Relaxations (arXiv:2605.22821), introduces ConvexTok, a vocabulary built by formulating tokeniser construction as a linear program (LP) relaxation of a flow-conservation integer program. The underlying combinatorial problem is provably NP-hard, and the authors solve the relaxed version with off-the-shelf convex optimisation. The result is a vocabulary chosen under a principled objective rather than a series of local greedy merges.

The methodological contribution is real, but the more useful deliverable is something the field has never had: a certifiable lower bound on compression. Solving the LP yields a provable floor below which no tokeniser on a given dataset can fall. At the vocabulary sizes typically used in production, the authors report that both BPE and ConvexTok land within 1% of the LP's lower bound, not within 1% of the true NP-hard optimum, which remains intractable to compute. The point is a measurement, not a victory. It means greedy tokenisation was closer to optimal than the field had any way to know, and it gives practitioners, for the first time, a yardstick to grade future vocabulary work against.

The empirical pattern in the paper is honest, and it is the part of the story most likely to be flattened by the usual "new algorithm beats BPE" framing. On intrinsic metrics (compression, vocabulary utilisation, Rényi entropy) ConvexTok improves over BPE consistently. On bits-per-byte (BpB), the language-modelling loss proxy that ties tokenisation to actual model behaviour, the gains are also consistent, and the deterministic rounding scheme is the strongest performer.

Downstream task results are a different story. The authors describe gains on the CORE benchmark as "less consistent" and "often (although not always)" present. A reader who treats this as a swap-your-tokeniser moment will over-promise. A reader who treats it as a measurement of where the current method falls short of a principled bound will read the paper the way the authors wrote it.

That gap, consistent gains on the metrics that describe tokenisation and mixed results on the metrics that describe tasks, is the most useful thing the paper puts on the table. It says something true about the field: optimising the tokeniser solves more of the problem than the consensus assumed, but it does not solve all of it. The rest lives in the model, the data, and the objective the model is trained on.

The mechanism is worth a paragraph. The paper models tokenisation as a directed acyclic graph over byte-strings, with free and priced edges, and a flow-conservation integer program that decides which edges survive into the vocabulary. The LP relaxation is solvable in polynomial time; a separate rounding step converts the fractional solution into a usable vocabulary. The authors ship three rounding schemes and find that they trade off differently. Bias rounding is strongest on intrinsic metrics. Deterministic rounding is strongest on BpB. The choice is not free, and the paper is clear about that.

There are real caveats. The work is an arXiv preprint, v1 only, submitted 21 May 2026, and it has not been peer reviewed. The authors themselves find that BPE is more stable across training-data choices than ConvexTok, which is a non-trivial concern for anyone considering a swap. There is no evidence yet of adoption in GPT, Claude, Llama, Qwen, or any major open-source tokeniser library. The accompanying Moonlight review restates the technical contribution but is not an independent empirical replication.

So what is this paper, taken seriously? It is the first credible attempt to take tokenisation off the list of problems the field treats as solved by convention. The lower bound is the part that will outlast any single benchmark number. Once the community can measure how close a vocabulary is to optimal, every future tokenizer paper can be graded against that bar, and the conversation can move from "does this beat BPE on CORE" to "does this beat the LP's floor, and by how much."

The next moves to watch are predictable. Will follow-up work tighten the gap between the LP bound and the achieved vocabulary, or expand the set of objectives the LP can encode? Will the major open-source tokeniser libraries adopt a lower-bound check as a sanity test before shipping a vocabulary? And will the field eventually treat the "intrinsic up, downstream flat" pattern as evidence that tokenisation was never the binding constraint on task performance in the first place?

For now, the most honest read is the smallest one. Tokenisation has a new tool. It is not a finished replacement for BPE. The yardstick is the contribution.

A Yardstick for Tokenisers, With a Certificate Attached

Sources