AI is getting more efficient. It is also getting more memory-hungry.

AI is getting more efficient. It is also getting more memory-hungry. — type0 | type0

PREVIEWAI is getting more efficient. It is also getting more memory-hungry. · MD

No GPU generation has ever shipped with less system memory than the last. Every byte saved by software has, within a year or two, reappeared as longer context windows, larger batch sizes, and agent workloads that burn millions of tokens each. That pattern is the simplest reason the next wall in AI is not compute. It is memory.

The chip in question is HBM, short for High Bandwidth Memory. It is the stacked DRAM bolted directly to AI accelerators so that a model can read and write weights and intermediate activations at the speed the silicon demands. Unlike regular DRAM or storage, HBM is wired to the accelerator through a wide parallel interface and cannot be paged out to cheaper memory without collapsing inference latency. That is why HBM demand and accelerator demand are essentially the same problem in two different units.

One independent analyst has been stress-testing this constraint to its limit. Markos, who runs the AAIG HBM Research Engine, published Entry 188 on July 2, 2026, with a model that assumes five of six supply constraints bind at once, ranging from leading-edge yield and advanced packaging to substrate supply, cleanroom capacity, and customer allocations: the full thread. Under that simultaneous-failure reading, the model projects HBM shortage every year through 2028. The headline figures from the source: 2026 demand around 4.8 exabytes against roughly 3.6 exabytes of shipped supply; 2027 demand around 6.7 exabytes against 5.6 exabytes of supply; 2028 demand around 9.8 exabytes against 7.2 exabytes of supply, with associated dollar pools of roughly 54 billion, 114 billion, and 164 billion respectively: source thread.

A third-party Grok summary circulating alongside the thread restates the same figures in plain language. That summary is a translation, not an independent check: Grok restatement.

The gap widens even as every new model release cuts memory per token. AI token traffic is on track to multiply four to seven times per year; Google publicly confirmed a 7x trajectory at I/O, and other large labs report similar figures. The same analysis bakes in roughly 30 percent per year of efficiency from techniques like Multi-Latent Attention, FP8 and FP4 quantization, and sparse attention patterns. Multi-Latent Attention alone trims the KV-cache, the working memory a model keeps mid-generation, by roughly 3x to 5x. The saved bytes, the source argues, are immediately reabsorbed by longer prompts, larger batches, and agent workflows: compilation of the efficiency claims.

That is the Jevons paradox at the memory layer. The same framework frames it this way: efficiency lowers the cost per token, which lowers the cost per useful task, which expands the set of tasks worth running, which adds tokens back into the system. Every major efficiency gain in AI has historically coincided with higher total memory consumption, not lower. To flip 2027 and 2028 to surplus, per the source, incremental efficiency on top of the 30 percent baseline would have to land at roughly 25 percent a year. The measured incremental rate, in the same framing, is closer to 5 percent a year and is decelerating. The constraint holds.

The other reason it holds is physical. A generation cannot keep its hot working set off the accelerator's local HBM. The same framework puts the share of the KV-cache that must physically live on HBM at a floor of about 65 to 70 percent for usable inference. Below that floor, decoding latency collapses and throughput follows. As models scale, that floor rises, not falls. Memory tiering, CXL-style pooling of memory across machines, and cheaper DRAM on certain accelerator variants, including NVIDIA's planned Rubin CPX, which uses GDDR7 for the latency-tolerant prefill half of an inference call, help widen the runway before the next memory generation has to absorb new demand. They do not replace the hot set.

The supply side has its own pressure points. Samsung is shipping HBM4 at 11.7 Gb/s into the Vera Rubin platform with yields around 60 percent, chasing 80 percent by year-end. SK Hynix is in mass production of HBM4 at 10.6 Gb/s with commercial shipments underway. Micron is positioned in the Rubin CPX inference tier. The next fault line to watch is the move to 16-high stack height, where the engineering and yield math gets harder: the supply track.

Two things temper how much weight to put on those numbers. First, the projections come from one author's research engine, not from vendor guidance or hyperscaler capex disclosures. Any reader using the dollar pools for procurement should wait for memory-maker bit-shipment data and customer commentary to confirm direction. Second, the model's stress-test framing is intentionally adversarial. Five of six constraints binding at once is not the base case any operator is budgeting for. The base case is looser. The structural read is that even the looser cases keep memory on the hot path.

That is the part planners and procurement teams should not lose. Token traffic is still compounding at multi-x per year. Software keeps improving at measurable but decelerating rates. HBM capacity rises on a fixed generation cadence from three suppliers. None of those inputs is independently negotiable, and the gap between them is widening, not closing. The watch items over the next eighteen months: Samsung's HBM4 yield curve, the timing of any 16-high announcement, hyperscaler commentary on token traffic, and any tiering or pooling architecture that demonstrably shifts working set off HBM at production scale.

AI is getting more efficient. It is also getting more memory-hungry.

Sources