AWS replaces the datacenter 'org chart' with a flat, random-mesh network

PREVIEWAWS replaces the datacenter 'org chart' with a flat, random-mesh network · MD

For most of the last decade, every engineering team that tried to replace the datacenter's hierarchical network "org chart" with a flat, random-mesh topology hit the same wall: the math worked in a paper, but in a real building full of switches and cables, it collapsed under its own routing complexity. AWS says it finally cracked that wall, and has been running the result inside its own facilities for months.

The system, called Resilient Network Graphs, is now in production inside AWS datacenters, according to Matt Rehder, the company's vice president of global network engineering. Speaking to The Register, Rehder framed the change as collapsing the "org chart" of switches that defines a conventional datacenter. In a traditional design, every packet that needs to reach a peer has to climb up to a parent switch and back down again, the way a memo travels through layers of management. Resilient Network Graphs lets switches talk more directly, removing the middle layers.

The gains, as AWS describes them, are substantial. The company claims the new design is up to a third faster for the east-west traffic that dominates AI and analytics workloads, and up to 40 percent more energy efficient than the hierarchical designs it replaces. Those are AWS-supplied figures from an executive interview, not independent benchmarks, and they are the load-bearing claims of the announcement.

The idea itself is not new. Researchers proposed random-graph topologies for datacenters in 2012, and a prototype called Jellyfish followed in 2015. Jellyfish proved the math and the resilience properties, but its routing rules and cabling were too tangled to operate at hyperscaler scale. The Register notes that AWS does not fully explain how it solved those constraints, including the limited memory of individual switches, which is the kind of constraint that tends to reappear at production size.

The timing is what matters for everyone else. AI training clusters have pushed east-west traffic to a level where the hierarchical tree, designed for north-south client requests, has become a measurable bottleneck. If a hyperscaler can run a flat random mesh in production and cut network power by 40 percent, that is a signal the rest of the industry will have to follow, or absorb the cost. Operators who already run on AWS may see the savings indirectly, in price and capacity. The architectural playbook is what travels, not the brand.

What to watch: an independent benchmark, an AWS engineering paper, or a customer confirming the latency and power figures outside a vendor interview. Until then, the production milestone is the news. The mechanism is plausible, and the predecessor attempts explain exactly how hard the problem was.

AWS replaces the datacenter 'org chart' with a flat, random-mesh network — type0 | type0

AWS replaces the datacenter 'org chart' with a flat, random-mesh network

Sources