When 'Intractable' Isn't: Solving Optimal Tokenization with ILP

When 'Intractable' Isn't: Solving Optimal Tokenization with ILP — type0 | type0

PREVIEWWhen 'Intractable' Isn't: Solving Optimal Tokenization with ILP · MD

For decades, operations researchers have used integer linear programming and cutting-plane methods to chip away at combinatorial problems that look impossibly hard, most famously the Traveling Salesman Problem. The same family of techniques has now been turned on tokenizers, the components that break text into pieces large language models can consume. A. Nichol's Finding Optimal Tokenizers post shows that an optimal tokenizer can be computed in practice, even though the underlying optimization is theoretically intractable.

The catch, and the reason this matters, is that "optimal" turns out to mean less than the word suggests.

The mechanism Nichol uses mirrors TSP-style cutting-plane methods: solve a relaxed version of the problem, find the constraint the relaxation is violating, add it, and repeat. For tokenizers, the analog is treating vocabulary assignment as an integer program whose objective is to compress the training corpus as much as possible given a vocabulary size. The result is a tokenizer that provably maximizes training-data compression for that vocabulary. It is a real solution to a real problem, and it generalizes a technique that has had a productive history with TSP, vehicle routing, and scheduling.

The surprise is in the numbers. Nichol's own measurements put state-of-the-art tokenizers within roughly 1% of the computed optimum on training data. That is not a gap worth chasing with a swap. If anything, it suggests that the heuristic BPE, WordPiece, and SentencePiece tokenizers that have shipped in production for years were already operating near a ceiling, at least on the training-compression axis. The question of whether a tokenizer that is optimal on training data generalizes better to held-out data, the only question that would actually move production systems, remains open. Nichol flags it explicitly: a tokenizer that compresses training data best may not be the one a model prefers at inference.

There is also a cheaper fix available. Inefficient tokenization can be largely absorbed by enlarging the vocabulary, which adds parameters and memory but does not require a new algorithm. The result is that the practical value of "optimal" tokenization looks small, even when the methodological result is real.

The interesting thread is what this opens up. Nichol describes the work as a hobby project, prompted by a community discussion and worked on for fun, and frames it explicitly as an invitation. If a cutting-plane formulation can settle tokenization in practice, the same template can be revisited on other LLM-infrastructure subproblems that the field has written off as intractable: prompt compression, batch construction, attention pattern selection, even some forms of data ordering. The methodological payoff is not a new tokenizer. It is a class of problems now worth re-examining with tools the field largely ignored.

There is also a more pointed research problem sitting in the open. Nichol's method finds a tokenizer optimal for training-data compression, but the LLM objective is held-out generalization. A tokenizer that minimizes cross-entropy on a held-out evaluation set, rather than compression on a training set, would be a different optimization, and probably a different answer. The author names this as the next step. Until that step is taken, "optimal" remains qualified: a real mathematical optimum on a goal the model itself does not share.

The result is a useful kind of paper: one that mostly validates the status quo, opens a methodological door, and names a sharper problem than the one it solved. Production teams do not need to swap tokenizers. Researchers do have a new tool, and a more concrete question to ask.

When 'Intractable' Isn't: Solving Optimal Tokenization with ILP

Sources