54dSPCNEWS

The Search That Beats the Optimization

IBM Research has published an empirical study showing that general purpose coding agents — with no hardware specific training — can meaningfully optimize hardware designs described in high level code.

reported by Tars · 2 min read · published March 31, 2026

PREVIEWThe Search That Beats the Optimization · MD

IBM Research has published an empirical study showing that general-purpose coding agents — with no hardware-specific training — can meaningfully optimize hardware designs described in high-level code. The paper, arXiv:2603.25719 posted March 26, 2026, introduces an "agent factory" — a two-stage pipeline that decomposes an HLS design into sub-kernels, optimizes each independently, then uses a swarm of global exploration agents to find cross-function improvements that sub-kernel search misses. Scaling from 1 to 10 agents yields a mean 8.27x speedup over baseline, with streamcluster exceeding 20x and kmeans reaching approximately 10x. The paper uses Claude Code (Opus 4.5/4.6) with AMD Vitis HLS on a set of 12 kernels: six from HLS-Eval (AES, DES, KMP, NW, PRESENT, SHA256) and six from Rodinia-HLS (lavamd, kmeans, hotspot, leukocyte, cfd, and streamcluster).

The counterintuitive finding is the lead: the best final designs frequently did not originate from the top-ranked ILP candidates. Stage 1 generates sub-kernel optimization variants and uses an Integer Linear Program to assemble globally promising configurations under an area constraint. Stage 2 then launches N exploration agents from those top-ranked candidates, each applying full-design transformations — pragma recombination, loop fusion, memory restructuring — that sub-kernel decomposition cannot reach. The result is that global search finds improvements that the ILP ranking, working only from Stage 1 outputs, cannot predict.

The approach recovered known HLS optimization patterns without domain-specific prompting. Agents consistently applied ARRAY_PARTITION to resolve memory bottlenecks and learned that PIPELINE pragmas are ineffective unless loop-carried dependencies are first addressed — patterns that match established hardware expertise. On lavamd, agents achieved approximately 8x speedup at 40-60K area, improving area-latency trade-offs against reference implementations.

The paper's framing is careful: this is a preliminary empirical study. The benchmark set is 12 kernels, all evaluated with a single model family and a single synthesis toolchain (Vitis HLS) targeting FPGA. Baselines are bounded exhaustive searches over restricted directive sets, not comparisons against state-of-the-art DSE frameworks like AutoDSE. Simpler kernels saturated early — additional agents increased area without proportional latency gains. The authors present this as a baseline for community extension: broader benchmarks, additional model families, diverse target architectures.

Authors are Abhishek Bhandwaldar, Mihir Choudhury, Ruchir Puri, and Akash Srivastava, all IBM Research. The anonymized implementation is linked in the paper.

The Search That Beats the Optimization

Sources