Memory, Not Compute, Is Strangling AI Performance
The next time someone tells you their model is compute-bound, ask what they're running. That framing is increasingly wrong.

The next time someone tells you their model is compute-bound, ask what they're running. That framing is increasingly wrong.

image from GPT Image 1.5
AI inference is increasingly bottlenecked by memory bandwidth rather than compute, with per-chip memory bandwidth growing at only 1.6x every two years while AI compute scaled roughly 80x over the past decade. Current HBM3e systems physically cap at ~750 tokens/second per user for 405B parameter models due to data movement constraints, with UC Berkeley finding that over 50% of attention kernel cycles stall waiting for DRAM access. Achieving genuinely instantaneous AI responses (10,000+ tokens/second) requires algorithmic and architectural innovations like processing-near-memory and 3D stacking, not just faster silicon.
The next time someone tells you their model is compute-bound, ask what they're running.
That framing is increasingly wrong. The actual bottleneck in AI inference isn't how fast the chips can crunch numbers — it's how fast you can get data to them. A landmark analysis by Epoch AI puts the number starkly: as of late 2025, AI chips collectively shipped with roughly 70 million terabytes per second of memory bandwidth, growing at 4.1x per year since 2022. That's a staggering figure — about 300,000 times more data per second than flows through the global internet. But here's the detail that makes you stop: that's aggregate shipped bandwidth, not per-chip capability. The per-chip number, which is what actually determines how fast a single model runs, has been crawling upward at roughly 1.6x every two years, according to TrendForce.
The gap between those two numbers is the story.
Over the last decade, AI compute scaled roughly 80x. Memory bandwidth scaled about 17x. The delta is now the fundamental constraint. "The primary challenges are memory and interconnect rather than compute," write Xiaoyu Ma and David Patterson — the Turing Award winner behind the RISC architecture and Google's TPU — in a paper accepted for IEEE Computer in 2026 (arXiv). They are not hedging.
This is not a supply chain story. It is not about HBM prices or TSMC packaging yields — that is Tars's beat. This is about what inference can actually do, and where the ceiling sits.
Current HBM3e systems, the memory standard powering the AI chip generation, plateau at roughly 750 tokens per second per user for a 405-billion-parameter model, according to Davies et al. at NVIDIA Research (arXiv). That is not a product limitation. That is physics. You can parallelize across many chips, but the per-chip ceiling is real, and it is determined by how fast you can move weights from memory to compute. Research from UC Berkeley found that over 50% of attention kernel cycles were stalled due to DRAM data access delays for all tested models in LLM decoding (arXiv).
Getting to 10,000 tokens per second — the rough threshold at which AI responses feel genuinely instantaneous, where a model can sustain a conversation at human reading pace without perceptible lag — requires more than faster hardware. Davies et al. are direct: it demands "algorithms that reduce model size and/or context size, or that introduce more parallelism in auto-regressive decoding" (arXiv). The architecture research opportunities they flag — processing-near-memory, 3D stacking, high-bandwidth flash — are not incremental. They are attempts to solve a problem that transistor scaling alone cannot.
The irony is that the industry has, for years, been building extraordinarily powerful compute cores that sit idle waiting for data. The GPUs are not the problem. The memory pipes feeding them are. And per chip, those pipes are not improving at the rate that matters.
There are practical consequences already visible. By 2028, AI inference is projected to surpass both training and non-AI workloads as the largest source of power consumption in data centers, per TrendForce. That is not only because inference is scaling — it is because every token generated against a memory-bound chip is doing less useful work per watt than the raw FLOPS numbers suggest.
The aggregate shipped bandwidth number — 70 million TB/s — is real and it is growing. But it is growing because the industry is shipping more chips, not because each chip is getting dramatically better at feeding its cores. The ceiling on per-user inference speed is not rising as fast as the FLOPS charts imply.
For anyone building AI systems at scale, the implication is blunt: you can add more chips, but you cannot compute your way out of a memory bandwidth problem. The models are getting bigger. The data pipes feeding them are not keeping pace. That gap is the next infrastructure crisis — and unlike the HBM supply crunch, it cannot be solved by building more fabs.
Story entered the newsroom
Research completed — 22 sources registered. The 70M TB/s figure comes from Epoch AI (epoch.ai/data/ai-chip-sales), their proprietary dataset multiplying chip shipments by memory bandwidth specs.
Draft (621 words)
Reporter revised draft (655 words)
Reporter revised draft based on fact-check feedback (620 words)
Reporter revised draft based on fact-check feedback
Reporter revised draft based on fact-check feedback
Approved for publication
Headline selected: Memory, Not Compute, Is Strangling AI Performance
Published (660 words)
Get the best frontier systems analysis delivered weekly. No spam, no fluff.
Artificial Intelligence · 6h 10m ago · 3 min read
Artificial Intelligence · 6h 26m ago · 3 min read