Only 14% of Chips Work First Try. Here's the AI Fix.

PREVIEWOnly 14% of Chips Work First Try. Here's the AI Fix. · MD

When hardware verification engineers say their job is thankless, they are not exaggerating. Functional verification consumes roughly 70 percent of total integrated circuit project time, according to industry studies cited in the UCAgent paper. In 2024, only 14 percent of IC projects achieved first-silicon success, with logic and functional defects remaining the leading cause of costly re-spins. The work is repetitive, precise, and chronically underfunded relative to the damage it prevents. A team at the Institute of Computing Technology, Chinese Academy of Sciences (ICT CAS), thinks an LLM agent can do most of it automatically — and their approach sidesteps the HDL problem entirely.

UCAgent, posted to arXiv on March 26, 2026 by authors Junyue Wang, Zhicheng Yao, Yan Pi, Xiaolong Li, Fangyuan Song, Jinru Wang, Yunlong Xie, Sa Wang, and Yungang Bao, is an end-to-end agent for block-level hardware functional verification. The architecture has three core components. First, a Python verification environment built on Picker and Toffee — Picker converts Register Transfer Level (RTL) designs into Python packages, and Toffee provides UVM-like verification abstractions, so the entire flow runs in Python without generating any SystemVerilog. Second, a configurable 31-stage verification workflow with automated per-stage checkers that handle requirement analysis, infrastructure construction, coverage interface setup, and testcase execution. Third, a Verification Consistency Labeling Mechanism (VCLM) that assigns hierarchical labels to LLM-generated artifacts, maintaining traceability from specification to coverage metrics to individual test cases.

The architectural choice to bypass hardware description languages is not accidental — it is a direct response to the training data problem. In The Stack v2, the largest open code dataset, Python accounts for 178.44 GB of data. Verilog accounts for 9.348 GB. SystemVerilog accounts for 0.84 GB. That is a 212x disparity between Python and SystemVerilog, and it explains why LLMs generate poor HDL: they have almost nothing to learn from. UCAgent's solution is to work where the data is — converting RTL to Python, then using the LLM's Python fluency to generate Python testbenches. LangChain with a ReAct reasoning loop drives the agent, and the Model Context Protocol (MCP) integration means Claude Code, OpenHands, Copilot, Gemini-CLI, and Qwen-Code can all call it as a backend. The three-tier stage checking hierarchy — Python Checker, LLM Checker, and Human Checker — provides escalation gates before anything gets marked verified.

Experimental results on UART, FPU, and integer divider modules show up to 98.5 percent code coverage and 100 percent functional coverage, according to the paper. More significantly, the authors report that UCAgent discovered previously unidentified design defects in realistic designs. The verification hackathon on the OpenVerify platform lists eight example modules, seven fully automated and one requiring human-machine collaboration (DualPort), suggesting the approach is not purely theoretical.

There is an obvious question: does this work in production? The evidence is suggestive but incomplete. The coverage numbers hold on three block-level modules; the defect discovery is real but limited in scope. The code is on GitHub under what appears to be an active project, and the MCP integration is a practical choice — being callable from Claude Code and similar tooling means engineers can supervise runs without abandoning their existing workflow. The 31-stage workflow is configurable, which matters: verification requirements vary significantly across block types, and a rigid pipeline would break on anything non-standard.

The training data disparity angle deserves attention beyond hardware. Python's 212x advantage over SystemVerilog in The Stack v2 is not unique to hardware description languages — it is a proxy for a broader pattern. Any domain where the target output language has sparse representation in training data is a candidate for the "work in the LLM's fluent domain, then translate" pattern. Hardware verification may be the clearest current demonstration, but the principle scales.

VCLM is worth watching specifically. Traceability from specification through coverage to test case is the load-bearing constraint in verification — it is what makes a failed coverage metric actionable rather than just a number. Most automated verification tooling optimizes coverage; UCAgent adds a labeling layer that ties coverage back to requirements. That is novel, and it is where the approach could either prove durable or become another layer that drifts out of sync under production pressure.

What UCAgent is not is a replacement for verification engineers. The workflow is automated; the supervision is not. Complex blocks — the DualPort module requiring human-machine collaboration in the hackathon set — still need a human in the loop. The question for adoption is whether the EDA ecosystem will integrate it, whether verification teams at chip companies will restructure workflows around it, and whether the coverage and defect-discovery numbers hold on a wider corpus of production designs. The architecture is real; the production deployment path is not yet clear.

The paper comes from ICT CAS and the Beijing Institute of Open Source Chip, a group that has been building toward open-source silicon tooling for some time. UCAgent fits that pattern — not a commercial product launch, but an open research contribution with implementation. That matters for reproducibility and community adoption, which is where the real test will play out.

Only 14% of Chips Work First Try. Here's the AI Fix. — type0 | type0

Only 14% of Chips Work First Try. Here's the AI Fix.

Sources