SRAM Hit the Wall: Why AI Chips Run at 20% Utilization
Your GPU is lying to you about how busy it is.
When an AI accelerator reports high utilization, the assumption is that it is grinding through matrix multiplications at full throttle. The reality, according to Eliyan CEO Ramin Farjadrad, is that processors in many cases run at 20% utilization or less — not because they cannot compute, but because they are waiting on memory. The bandwidth from memory has not increased 100X in the same period that compute throughput has scaled by five orders of magnitude. The gap is over 1,000X.
This is not a new problem. The memory wall has been building for years. But the AI surge is making a chronic condition acute at exactly the wrong moment.
The SRAM Cliff
SRAM — the on-chip memory that holds instructions and working data for processors — is built around a 6-transistor (6T) bitcell designed in the 1980s for density. It has a structural flaw that is now fatal: the access transistor fights with the storage transistor during read and write operations, and process variation at small geometries makes that fight impossible to balance cleanly. From 65nm to 5nm, each node delivered 50% to 100% density improvements in SRAM. At 2nm and below, that number is less than 15% per node — a cliff, not a curve. Synopsys principal product manager for embedded memory IP Daryl Seitzer put it plainly: the SRAM bitcell was invented to be dense, and it has an inherent flaw of conflicting read and write requirements that gets harder to balance at every new process node.
The timing is brutal. A landmark paper from researchers at Google and the University of Illinois — AI and Memory Wall (arxiv 2403.14123) — documents the divergence in no uncertain terms. AI model computing power scaled 3X every two years from 2022 to 2025. Memory bandwidth grew 1.6X in the same period. Interconnect bandwidth grew 1.4X. Over a longer horizon, Nvidia GPU 64-bit FLOPS rose 80X from 2012 to 2022, while bandwidth grew 17X. The arithmetic is not complicated: newer AI models have lower arithmetic intensity than the models that came before. They need to fetch more data per FLOP. LLMs are memory-bandwidth-bound, not compute-bound. That is why GPU utilization sits at 20% even when peak FLOPS look fine on paper.
The conventional response to a memory bottleneck is to add more SRAM cache on-chip. That is exactly what is failing. SRAM density scaling has stalled. At 3nm, Synopsys managed to match the SRAM density that Intel and TSMC achieved at 2nm — 0.021 µm² bit cells delivering 38.1 megabits per square millimeter — but the Synopsys SRAM maxed out at 2.3 GHz, compared to 4.2 GHz for TSMC and 5.6 GHz for Intel at the same density. You can have density or speed. Taking both requires a new architecture.
What the Industry Is Actually Building
Intel 18A and TSMC N2 both use gate-all-around (GAA) nanosheet transistor architecture, which gives more flexibility in tuning transistor width than the FinFET designs they replace. Both showed SRAM bitcells around 0.021 µm² at ISSCC 2025 — Intel 23% denser than the prior generation, TSMC 12% denser. Incremental gains. Neither company is pretending otherwise.
The more ambitious fix is restructuring the memory hierarchy itself. Die-to-die links — connecting separate chips inside a single package — offer a path to bandwidth that monolithic die cannot. Eliyan taped out its NuLink PHY IP on TSMC N3, achieving 64 Gbps per bump with standard packaging, which translates to 4.55 Tbps per millimeter of interface width. Eliyan CEO Ramin Farjadrad: In many cases, we see 20% utilization of the processor for most functions, if not less. It is mainly limited by the memory and memory bandwidth.
But die-to-die chiplet architectures are expensive. Renesas principal product marketing manager Kavita Char: At some point SRAM becomes non-scalable, and then it starts to occupy a larger percentage of the total die size. Chip designers have to decide what can live on-chip and when they have to reach for external memory.
The HBM Supercycle
High Bandwidth Memory — HBM3e today, HBM4 expected in mass production in 2026 — is the industry primary workaround. HBM stacks DRAM dies vertically and connects them to the processor via a 1,024-bit to 2,048-bit interface, delivering bandwidth that planar DRAM cannot. HBM4 is targeting 2 terabytes per second of bandwidth with a 2,048-bit interface.
The demand is explosive. HBM consumption grew more than 130% year-over-year in 2025 and is expected to grow more than 70% YoY in 2026, driven by next-generation AI accelerator platforms from all major chipmakers and by Google TPU and AWS Trainium adoption. SK Hynix currently leads HBM supply. Samsung has struggled with yields. The result is a supply crunch that kept HBM3e priced at a 4x to 5x premium over server DDR5 — TrendForce projects that premium will compress to 1x to 2x by the end of 2026 as production scales.
China is investing heavily to close the gap. ChangXin Memory Technologies (CXMT) — the country's leading DRAM maker — is targeting HBM3 production by the end of 2026, backed by a $4.2 billion Shanghai IPO and sustained state-aligned financing. Bloomberg reported proposals in late 2025 for a new subsidy and financing package in the range of 200 billion to 500 billion yuan — roughly $28 billion to $70 billion — for the broader chip sector, on top of a decade of existing state-backed investment. Whether that translates to competitive HBM before the next node transition is an open question: yields and DRAM process maturity take time to build, and US export controls restrict China access to the most advanced semiconductor equipment needed for leading-edge HBM production.
Hanmi TC holds a dominant position in the bonders used to attach DRAM dies in 3D stacking, creating a tooling chokepoint that is separate from the chip itself.
What This Means for AI Builders
The memory wall changes the calculus for anyone building AI systems. Peak FLOPS are a misleading spec when the actual constraint is memory bandwidth. A chip that reports 60,000X higher peak compute than a 2003 server may deliver less than 100X the actual work per second for LLM inference.
AI developers are beginning to rewrite inference stacks for memory locality — optimizing data movement, not just FLOPs. This is a systems problem, not a circuit problem. Arteris senior manager of product management Andre Bonnardot: When memory density growth slows, simply adding more cache becomes economically inefficient. 3D stacking, die-to-die PHY, and advanced packaging all add cost and thermal complexity. For now, these are solutions for premium AI accelerators — the B300s, GB300s, and custom silicon inside hyperscaler data centers. Mass-market chips will live with the memory wall for longer.
This is not a problem the next process node fixes. SRAM scaling is not recovering. The memory wall is a physical constraint that will define AI infrastructure decisions for the next five years. Every hyperscaler, every AI startup sizing a cluster, every fab planning capacity is now making bets on memory — not just logic. The chip industry built a trillion-dollar AI boom on the assumption that compute would scale to meet demand. The memory wall suggests that assumption was always incomplete.