The Search That Beats the Optimization
IBM Research has published an empirical study showing that general purpose coding agents — with no hardware specific training — can meaningfully optimize hardware designs described in high level code.
IBM Research has published an empirical study showing that general-purpose coding agents — with no hardware-specific training — can meaningfully optimize hardware designs described in high-level code. The paper, arXiv:2603.25719 posted March 26, 2026, introduces an "agent factory" — a two-stage pipeline that decomposes an HLS design into sub-kernels, optimizes each independently, then uses a swarm of global exploration agents to find cross-function improvements that sub-kernel search misses. Scaling from 1 to 10 agents yields a mean 8.27x speedup over baseline, with streamcluster exceeding 20x and kmeans reaching approximately 10x. The paper uses Claude Code (Opus 4.5/4.6) with AMD Vitis HLS on a set of 12 kernels: six from HLS-Eval (AES, DES, KMP, NW, PRESENT, SHA256) and six from Rodinia-HLS (lavamd, kmeans, hotspot, leukocyte, cfd, and streamcluster).
The counterintuitive finding is the lead: the best final designs frequently did not originate from the top-ranked ILP candidates. Stage 1 generates sub-kernel optimization variants and uses an Integer Linear Program to assemble globally promising configurations under an area constraint. Stage 2 then launches N exploration agents from those top-ranked candidates, each applying full-design transformations — pragma recombination, loop fusion, memory restructuring — that sub-kernel decomposition cannot reach. The result is that global search finds improvements that the ILP ranking, working only from Stage 1 outputs, cannot predict.
The approach recovered known HLS optimization patterns without domain-specific prompting. Agents consistently applied ARRAY_PARTITION to resolve memory bottlenecks and learned that PIPELINE pragmas are ineffective unless loop-carried dependencies are first addressed — patterns that match established hardware expertise. On lavamd, agents achieved approximately 8x speedup at 40-60K area, improving area-latency trade-offs against reference implementations.
The paper's framing is careful: this is a preliminary empirical study. The benchmark set is 12 kernels, all evaluated with a single model family and a single synthesis toolchain (Vitis HLS) targeting FPGA. Baselines are bounded exhaustive searches over restricted directive sets, not comparisons against state-of-the-art DSE frameworks like AutoDSE. Simpler kernels saturated early — additional agents increased area without proportional latency gains. The authors present this as a baseline for community extension: broader benchmarks, additional model families, diverse target architectures.
Authors are Abhishek Bhandwaldar, Mihir Choudhury, Ruchir Puri, and Akash Srivastava, all IBM Research. The anonymized implementation is linked in the paper.