The data center network was designed for a world where GPUs compute and storage sits somewhere else. That world is ending.
AI inference workloads don't burst — they flow. A large language model processing a long document or maintaining a multi-turn conversation generates sustained, long-lived network flows that persist on the same paths for minutes or hours at a time. Training, by contrast, is event-driven: a checkpoint write, a gradient sync, then quiet. Networks built for that rhythm — with buffering, with Equal-Cost Multi-Path routing that spreads traffic across random paths — worked fine for training. They are not built for inference's permanent pressure.
As Semiconductor Engineering reported, this traffic pattern shift is forcing a structural rethink of where storage fits in the data center hierarchy. The traditional model placed storage as an external service, accessed occasionally through north-south paths. Inference changes that math entirely.
"Storage is no longer an external service accessed occasionally through North-South paths. It becomes part of the high-performance network-based memory fabric itself," the Semiconductor Engineering analysis noted. The implication: if inference is going to keep reaching back into context stored somewhere other than GPU high-bandwidth memory, that access has to happen at network line rates — not at the speed of a storage array answering a query.
NVIDIA's answer is the Inference Context Memory Storage Platform (ICMSP), announced alongside the Rubin platform and BlueField-4 data processing unit. The core idea is a new G3.5 memory tier — an Ethernet-attached flash layer purpose-built for KV cache data, sitting between GPU HBM and conventional shared storage. BlueField-4 acts as the bridge, exposing remote NVMe over Fabrics as direct memory-mapped addresses the GPU can reach without CPU intervention. The result is KV cache offloading that doesn't feel like going to disk.
The performance claims are substantial. NVIDIA said ICMSP delivers up to five times higher tokens-per-second and five times greater power efficiency compared with traditional storage in inference workloads. VAST Data, a storage company participating in the ICMSP ecosystem, reported seeing a 10x improvement in inference capability from a single GPU server by offloading KV cache working with NVIDIA Dynamo. These are early-stack numbers, not peer-reviewed benchmarks — but they're coming from companies with shipped products and existing customer deployments, not slides.
The partner list for BlueField-4 ICMSP integration reads like a storage industry roll call: AIC, Cloudian, DDN, Dell Technologies, Hewlett Packard Enterprise, Hitachi Vantara, IBM, Nutanix, Pure Storage, Supermicro, VAST Data, and WEKA. That breadth matters. ICMSP only works if the storage layer participates — NVMe over Fabrics has to be the protocol, not NFS or iSCSI.
WEKA's vice president made a point worth dwelling on: traditional storage protocols and data services introduce overhead — metadata paths, small-IO amplification, durability and replication defaults, multi-tenant controls applied in the wrong place — that can turn fast context into slow storage. The KV cache, by design, is small, hot, and latency-sensitive. Legacy storage was built for the opposite workload profile. The G3.5 tier exists precisely because existing protocols are the wrong abstraction for what inference actually needs.
NVLink 6 underpins this with 3.6 terabytes per second of GPU-to-GPU bandwidth within an NVL72 rack, enabling full all-to-all fabric connectivity across the system. That's the intra-rack picture. ICMSP extends the memory domain beyond a single rack's GPU pool to a shared flash tier accessible over standard Ethernet — a meaningful departure from the tight coupling that NVLink demands.
Microsoft has already committed. It became the first hyperscale cloud provider to power on NVIDIA's Vera Rubin NVL72 systems, deploying to liquid-cooled Azure Fairwater datacenters in Wisconsin and Atlanta over the coming months, according to an announcement at GTC 2026 on March 16. Hyperscalers are not waiting for the ecosystem to mature before committing.
What NVIDIA is essentially doing is rebuilding the memory hierarchy for a world where inference, not training, is the dominant workload. The Rubin platform promises 10x higher inference throughput and 10x lower cost per token versus the prior generation. Whether those numbers hold in production across diverse model architectures is the question every infrastructure buyer is racing to answer before committing to a rack redesign.
The broader implication: data center networking and storage are converging. Networks are becoming memory buses. Storage is becoming a compute tier. The wall between "network" and "storage" — already porous with NVMe over Fabrics — is dissolving under the weight of inference's memory appetite. Anyone building AI infrastructure at scale is going to have to make architectural decisions that blur that boundary deliberately, or watch their utilization economics collapse under the weight of context they can't serve fast enough.