For five decades, the semiconductor industry answered the demand for more computing power with a single trick: make the transistor smaller. That playbook is running out of room. Artificial intelligence workloads now stress compute, memory bandwidth, interconnect capacity, and power delivery all at once, and solving one of those bottlenecks exposes the next. The industry's response, visible in Google's latest accelerator and the rhetoric of a leading European chip-research consortium, is to stop thinking of the chip as a single square of silicon and start treating the whole package as one orchestrated computer that combines stacked dies, vertical wires, memory stacks, photonic links, and cooling.
That architectural pivot is the bet behind imec's "heterogeneous large-scale integration" (HLSI) thesis, articulated by the Belgian research center's vice president of R&D, Julien Ryckaert, in a recent EE Times essay. "Moore's Law does not need to be broken, only re-targeted toward more compute per unit area," Ryckaert argued. In his reading, the next wave of gains comes not from any one transistor breakthrough but from composing specialized silicon dies, memory stacks, power-delivery circuits, and photonic interconnects into a bonded tower that the package treats as a single system. The package becomes the new chip.
The shift is already visible in hyperscaler silicon. Google's TPU v7, codenamed "Ironwood" and documented in the company's official TPU v7 documentation, is the first chip in Google's AI accelerator line explicitly tuned for inference at scale. Google's own engineering writeup describes it as part of a "codesigned AI stack" of accelerator, memory, interconnect, and inference software designed together rather than as separate artifacts. The Next Platform reported that Ironwood pushes the accelerator design down to the floor of the data center rack, and The Next Web reads the move as Google splitting its next TPU in two. The implication is that the chip itself is no longer the unit of design. The stack is.
Microsoft is making the same case from a different angle. Its silicon-to-service framing treats custom AI accelerators not as standalone silicon but as one component of a vertically integrated response to AI demand. The company's message: the question is no longer whether a single chip is faster, but whether the path from wafer to served inference improves together. Meta's MTIA program, which has produced four custom AI chips in roughly two years, each tightly tied to Meta's recommendation and inference workloads, shows the same pattern at smaller scale. Hyperscaler silicon now lives in fast iteration loops tied to specific model families, not on long monolithic roadmaps.
None of this is consensus. imec's HLSI framing is one research consortium's read of where the industry's bottlenecks are converging, not a settled paradigm, and the 3D-stacking and hybrid-bonding techniques it relies on are still climbing their own yield, cost, and thermal curves. Hyperscaler blogs are engineering narratives, not independent benchmarks: when Google calls Ironwood the first inference-tuned TPU, or when Microsoft pitches "silicon to service," the companies are describing their own stacks in their own terms. And not every AI workload actually needs a tower of bonded dies. Many still run well on conventional GPUs.
What to watch next is whether the architectural shift lowers cost per AI operation for buyers outside the hyperscalers, or whether stacked-die economics stay confined to companies that can afford custom packaging lines. Google's first large-scale Ironwood deployments will be the cleanest public test of whether systems-orchestrated inference can outperform dense GPU clusters on real workloads at real prices. The next twelve months of TPU and MTIA disclosures, more than any transistor-shrink milestone, will tell readers whether the package really has become the new chip.