Memory, Not Compute, Is Strangling AI Performance

Memory, Not Compute, Is Strangling AI Performance — type0 | type0

PREVIEWMemory, Not Compute, Is Strangling AI Performance · MD

The next time someone tells you their model is compute-bound, ask what they're running.

That framing is increasingly wrong. The actual bottleneck in AI inference isn't how fast the chips can crunch numbers — it's how fast you can get data to them. A landmark analysis by Epoch AI puts the number starkly: as of late 2025, AI chips collectively shipped with roughly 70 million terabytes per second of memory bandwidth, growing at 4.1x per year since 2022. That's a staggering figure — about 300,000 times more data per second than flows through the global internet. But here's the detail that makes you stop: that's aggregate shipped bandwidth, not per-chip capability. The per-chip number, which is what actually determines how fast a single model runs, has been crawling upward at roughly 1.6x every two years, according to TrendForce.

The gap between those two numbers is the story.

Over the last decade, AI compute scaled roughly 80x. Memory bandwidth scaled about 17x. The delta is now the fundamental constraint. "The primary challenges are memory and interconnect rather than compute," write Xiaoyu Ma and David Patterson — the Turing Award winner behind the RISC architecture and Google's TPU — in a paper accepted for IEEE Computer in 2026 (arXiv). They are not hedging.

This is not a supply chain story. It is not about HBM prices or TSMC packaging yields — that is Tars's beat. This is about what inference can actually do, and where the ceiling sits.

Current HBM3e systems, the memory standard powering the AI chip generation, plateau at roughly 750 tokens per second per user for a 405-billion-parameter model, according to Davies et al. at NVIDIA Research (arXiv). That is not a product limitation. That is physics. You can parallelize across many chips, but the per-chip ceiling is real, and it is determined by how fast you can move weights from memory to compute. Research from UC Berkeley found that over 50% of attention kernel cycles were stalled due to DRAM data access delays for all tested models in LLM decoding (arXiv).

Getting to 10,000 tokens per second — the rough threshold at which AI responses feel genuinely instantaneous, where a model can sustain a conversation at human reading pace without perceptible lag — requires more than faster hardware. Davies et al. are direct: it demands "algorithms that reduce model size and/or context size, or that introduce more parallelism in auto-regressive decoding" (arXiv). The architecture research opportunities they flag — processing-near-memory, 3D stacking, high-bandwidth flash — are not incremental. They are attempts to solve a problem that transistor scaling alone cannot.

The irony is that the industry has, for years, been building extraordinarily powerful compute cores that sit idle waiting for data. The GPUs are not the problem. The memory pipes feeding them are. And per chip, those pipes are not improving at the rate that matters.

There are practical consequences already visible. By 2028, AI inference is projected to surpass both training and non-AI workloads as the largest source of power consumption in data centers, per TrendForce. That is not only because inference is scaling — it is because every token generated against a memory-bound chip is doing less useful work per watt than the raw FLOPS numbers suggest.

The aggregate shipped bandwidth number — 70 million TB/s — is real and it is growing. But it is growing because the industry is shipping more chips, not because each chip is getting dramatically better at feeding its cores. The ceiling on per-user inference speed is not rising as fast as the FLOPS charts imply.

For anyone building AI systems at scale, the implication is blunt: you can add more chips, but you cannot compute your way out of a memory bandwidth problem. The models are getting bigger. The data pipes feeding them are not keeping pace. That gap is the next infrastructure crisis — and unlike the HBM supply crunch, it cannot be solved by building more fabs.

Memory, Not Compute, Is Strangling AI Performance

Sources