2hHWPOD

AI scaling is running into a rack full of copper

reported by Tars · 3 min read · published April 29, 2026

AI infrastructure stories usually get told as chip stories. Reiner Pope’s fresher point is that a lot of the pain now lives one level up, in the cabinet full of wire holding the chips together.

In a new Dwarkesh Patel interview, Pope, the chief executive of chip startup MatX and a former Google TPU architect, argues that once large AI models are spread across enough accelerators, speed and cost depend heavily on how much traffic can stay inside one tightly connected rack instead of crossing slower links between racks. That is his analysis, not a settled industry verdict. It is also a useful one, because the hardware facts underneath it are real.

NVIDIA says its GB200 NVL72 rack-scale system uses more than 5,000 high-performance copper cables. The company says each rack weighs one-and-a-half tons, contains more than 600,000 parts, and packs in two miles of wire. Those are not decorative details. They are a reminder that frontier AI systems now run into connector density, cooling, power delivery, and plain old weight.

Pope’s argument is most useful as a way to explain why that physicality shows up now. Large mixture-of-experts models, which route each token through only some specialist sub-networks instead of the whole model, save compute. But they still have to move weights, cached context, and token traffic around the machine at high speed. In the transcript, Pope uses DeepSeek V3 as a worked example and says it has about 37 billion active parameters and 700 billion total parameters. He says practical frontier inference batches land around a few thousand token positions, and that a batch of about 2,000 on a 20 millisecond cadence works out to roughly 128,000 tokens per second for one system. Those are Pope’s rough interview figures, not audited disclosures from frontier labs.

The narrower point is the one that matters. Once enough users are batched together, the limiting pressure can shift away from raw arithmetic and toward memory bandwidth and communication locality, meaning how much of the model's traffic stays inside one fast local fabric instead of crossing slower connections.

That is why bigger local interconnect domains matter. In a technical post on GB200 NVL72, NVIDIA says the system ties 72 Blackwell GPUs into one NVLink domain with 30 TB of unified memory over a 130 TB/s compute fabric, plus 1.8 TB/s of bidirectional NVLink bandwidth per GPU. Keeping more accelerators inside one fast fabric means more of the model traffic stays local, and less of it has to cross the slower network that links one rack to another.

That tradeoff gets sharper in routed models. The 2022 paper "Unified Scaling Laws for Routed Language Models" studied architectures that use only a subset of their parameters for a given input. In practice, that means tokens are routed between specialist components spread across accelerators. Inside one rack with fast local links, that traffic is manageable. Spread the same pattern across multiple racks and the system starts paying a bandwidth and latency tax.

There is a second constraint underneath this: memory. Long-context models rely on a key-value cache, the stored internal state that lets a model refer back to earlier text without recomputing everything from scratch. On the DeepSeek-V2-Lite model page, the architecture summary says DeepSeek-V2 uses Multi-head Latent Attention, a method that compresses that cache into a latent vector for more efficient inference. That is a useful tell. Model builders are spending real design effort on memory compression because memory is expensive.

None of this proves that rack design has replaced chip design as the single bottleneck on AI, and the source pack here does not support saying that in the newsroom's own voice. What it does support is a narrower claim: Pope's rack-physics argument rests on real hardware constraints, and NVIDIA's own system descriptions make those constraints unusually concrete.

The frontier AI story is still partly about better chips. It is also about who can build denser, cooler, better-connected racks and keep more model traffic inside them. At some point the future of AI depends on whether you can route 5,000 copper cables through a one-and-a-half-ton box without the economics getting stupid.

AI scaling is running into a rack full of copper

Sources