SwarmIO emulator lets researchers test SSDs that outperform today's fastest drives
KAIST's SwarmIO Gives Researchers a Way to Test Drives That Don't Exist Yet
There's a gap opening up in AI infrastructure, and it's getting harder to ignore. GPUs can now generate tens of millions of storage requests per second. The SSDs meant to service those requests cannot. Kioxia's new GP Series drive, announced at NVIDIA GTC in March, targets 10 million IOPS this year with a roadmap to 100 million IOPS by 2027. Evaluation samples will be available to select customers in Q3 2026. But that hardware is still months away, and the systems thinking about how to use it needs to start now.
That's the problem SwarmIO is built to solve.
A research group at KAIST, led by Professor Minsoo Rhu — who was inducted into the ISCA Hall of Fame in 2024 (alongside the HPCA Hall of Fame in 2021 and MICRO Hall of Fame in 2022) — has published an open-source SSD emulator called SwarmIO that models IOPS-optimized storage at scales no physical drive currently sustains. The paper, posted to arXiv on April 8, describes a system that achieves 303.9 times the throughput of the best existing SSD emulator under GPU-initiated I/O workloads, and demonstrates its utility in a vector search case study where moving from 2.5 million IOPS to 40 million IOPS yielded a 9.7-times end-to-end speedup.
The core challenge is architectural. Traditional SSD emulators were designed for host-CPU I/O, where a relatively small number of threads generate requests. GPU-initiated I/O is fundamentally different: thousands of GPU threads can each submit independent storage requests simultaneously, producing request streams that dwarf what CPU-side systems were built to handle. The three bottlenecks the KAIST team identified are worth understanding in detail, because they're not obvious.
First, frontend scalability. Conventional emulators ingest requests through a shared software frontend that wasn't designed for massive parallelism. Under GPU workloads, queues build up immediately and throughput collapses. SwarmIO distributes request ingestion across a parallelism-aware architecture that scales with thread count rather than choking on it.
Second, control path overhead. GPU-initiated I/O follows different data and control paths than CPU-generated I/O. In existing emulators, moving data between CPU-side storage structures and GPU-resident I/O buffers requires software-mediated copies that add latency at every step. SwarmIO offloads these copy operations to the Intel Data Streaming Accelerator, a hardware offload engine already present in modern server CPUs, and includes a co-designed kernel-level API to maximize DSA utilization in a multi-threaded environment.
Third, timing model maintenance. At very high IOPS rates, updating per-request state in the emulator's timing model becomes a bottleneck in itself. SwarmIO aggregates timing updates across batches of requests, amortizing the bookkeeping overhead without meaningfully degrading timing fidelity.
The result is an emulator that KAIST validated against an enterprise 2.5 million IOPS SSD — confirming fidelity — and then scaled to 40 million IOPS on real hardware. The 100 million IOPS target remains aspirational pending faster evaluation hardware, but the architectural work is done. The code is on GitHub.
This matters because the storage industry is genuinely moving toward IOPS-first SSD designs, not just bandwidth-first. Kioxia's GP Series uses XL-FLASH storage-class memory, a SLC-based medium that sacrifices raw density for lower latency and higher IOPS, specifically to handle fine-grained 512-byte accesses. Current enterprise NVMe drives typically operate at 4K minimum access granularity. For GPU-initiated workloads serving KV cache lookups or attention head activations, the difference is significant: smaller accesses keep the PCIe bus better utilized and reduce wasted data movement.
NVIDIA's Storage-Next initiative, which includes Kioxia and InnoGrit among others, is explicitly designing this future. The architectural bet is that as AI models grow toward trillion-parameter scales with multi-million token context windows, HBM capacity will consistently fall short, and storage will have to function as an active, ephemeral memory tier rather than a passive capacity layer. The memory-storage hierarchy that has held for decades is being replaced by something flatter and more demanding.
SwarmIO is, in one sense, a timing-modeling contribution. But the deeper story is the chicken-and-egg problem it solves. System architects can't evaluate what hardware will do in their workloads until the hardware exists. An emulator that's fast enough to run full AI inference pipelines — not just microbenchmarks — lets them design ahead of the silicon. The 9.7-times vector search speedup isn't a prediction; it's a measurement on real GPU hardware running a retrieval-augmented generation workload, with storage performance modeled faithfully at 40 million IOPS.
That's a different kind of result than the benchmark numbers SSD vendors typically publish. It's also more useful. For anyone building or specifying AI inference systems in the next two to three years, the question isn't whether ultra-high IOPS storage will arrive. It will. The question is whether your software stack will be ready to use it when it does.