Nvidia Research's HORIZON is a self-evolving agent framework for hardware design, and it has hit 100% completion across four major chip-design test suites: ChipBench, RTLLM, Verilog-Eval, and nine categories from the CVDP benchmark family. The result, reported by Semiconductor Engineering's coverage of the preprint, is real. So is the caveat the researchers put near the top of their arXiv preprint: these are controlled proxy benchmarks, not the broader engineering problem of designing chips that ship.
That gap, between the score and what the score does not prove, is the actual story.
HORIZON's authors, Cunxi Yu, Chenhui Deng, Nathaniel Pinckney, and Brucek Khailany, all Nvidia Research, describe a system that runs without a human in the loop. A Markdown harness is compiled into a "project pack" containing domain knowledge, an executable evaluator, an acceptance predicate, and a git/runtime policy. The hands-free agent then evolves an isolated git worktree, with each git commit serving as an acceptance checkpoint. The architecture is, in effect, a software-engineering loop: write a candidate, simulate it, check whether it passes, commit if it does, repeat.
The framing shift is from one-shot RTL (register-transfer level) prompt completion to a version-controlled evolving workspace where generation, simulation, and repair are tied to executable acceptance gates. That distinction is what separates HORIZON from earlier attempts to apply large language models to RTL code, the hardware-design language that describes how a chip's data moves between registers.
The work extends a line of repository-scale self-evolution research that has previously targeted software systems. AlphaEvolve applied the pattern to algorithm kernels. SATLUTION applied it to SAT solvers. ABCEvo applied it to logic synthesis in ABC, a widely used open-source tool in the EDA (electronic design automation) toolkit. HORIZON's contribution is to push that pattern into the hardware designs engineers create, not just the tools they use to build them.
The benchmarks the system cleared are the standard academic proxies for the early stages of chip design. ChipBench and Verilog-Eval test functional correctness on small RTL modules. RTLLM probes LLM-generated hardware on more complex designs. CVDP, the Circuit Vision and Design Problems benchmark, spans nine categories. All of them are "is the code right?" questions, not "will this chip work in a phone?" questions. The paper's own evaluation makes the distinction: once executability is solved, the residual bottlenecks are efficiency and verification quality, not pass-rate convergence.
That last point comes from Section 5 of the preprint, which the authors devote to limitations and open research challenges rather than further benchmark wins. The concrete problems they flag cover the gaps between clearing these controlled suites and the larger job of taking a chip from verified RTL code to a tape-out that can be manufactured: scaling to bigger designs, integrating the agent loop with the human review and sign-off that production programs still require, and the efficiency and verification quality issues that survive even when pass-rate is solved.
For now, HORIZON is a research result on controlled suites, not a tool Nvidia is shipping, not a customer-facing product, and not a replacement for the engineering teams that take a chip from RTL to silicon. The 100% line is the headline. The interesting part is what comes after it.