Chiplets Broke The On-Chip Telemetry Plane. AI Is Now Paying For It.

Chiplets Broke The On-Chip Telemetry Plane. AI Is Now Paying For It. — type0 | type0

PREVIEWChiplets Broke The On-Chip Telemetry Plane. AI Is Now Paying For It. · MD

AI's hunger for compute has pushed chipmakers toward chiplets, the modular silicon tiles now replacing one monolithic slab of silicon with several pieces wired into a single processor package. The trade has paid off in yield, in cost, and in the ability to mix specialized accelerators without rebuilding a chip from scratch. It has also, by design, broken the unified telemetry plane engineers used to take for granted inside a single die.

That matters because the same AI workloads that justify chiplets are now being asked to manage, debug, and optimize the chips that run them, and AI can only work with the data the silicon exposes. Inside a monolithic chip, on-die sensors and trace infrastructure gave engineers a continuous picture of traffic, latency, congestion, and fault behavior. Once that picture is split across several dies sharing a package, each die tends to ship its own isolated telemetry. What used to be a single stream is now a patchwork of partial views that don't naturally align, and according to a recent Semiconductor Engineering roundtable, this fragmentation is best understood as an architectural failure, not a debugging nuisance.

The piece is the latest "Experts At The Table" panel from Semiconductor Engineering, gathering nine EDA and silicon leaders around one question: where does observability actually belong in the AI-era chiplet stack. Their answers line up less around tooling and more around fabric. Observability, in the AI era, is not a debugging afterthought bolted on after tape-out. It is a missing architectural layer that has to be designed into the package fabric from the start, with cross-die correlation as the baseline rather than the goal.

The architectural response is already underway. Vendors are pulling telemetry collection closer to the sensors themselves, doing reduction in hardware so the package does not drown in raw traces. They are exposing programmable collection points so architects can decide what to capture, at what cadence, and at what fidelity, instead of being handed whatever the die happened to log. And they are moving toward software-accessible data models, standardized ways for downstream tools to consume what the silicon produces without reverse-engineering each die's private format. Each of those steps is a move toward treating observability as a first-class layer of the package rather than an artifact of a single die.

In practical terms, that means an accelerator failure in one chiplet can be correlated against traffic patterns in a neighboring tile, thermal behavior in the package substrate, and firmware events from the host. Without that correlation, fleet operators are stuck rebooting black boxes and chasing symptoms. With it, the loop closes: silicon tells the AI what is happening, AI proposes a response, and humans see a model they can trust before they act on it. The faster that loop gets, the more leverage AI gets from fleet-scale telemetry, and the harder it is for any one die to hide a fault from the rest of the package.

The harder problem is upstream of any single vendor's roadmap. A multi-vendor chiplet ecosystem, where one company designs the compute die, another the I/O die, a third the memory tile, needs a shared, secure telemetry schema to make any of this work across package boundaries. Today, that standard does not exist, and the experts gathered by Semiconductor Engineering treat the gap as a real coordination failure, not a marketing grievance. Without it, fault localization across die, package, and interconnect stays bespoke, security boundaries around operational data stay ambiguous, and AI's ability to read a fleet of mixed-origin accelerators stays capped by what each supplier chooses to expose.

A companion Semiconductor Engineering piece, "Observability Is Essential For Modern Silicon," frames the same idea from the other side: observability has stopped being a debugging afterthought and started looking like a baseline architectural requirement. The chiplet conversation is the sharper version of that argument, because it shows exactly what breaks when you stop taking the unified telemetry plane for granted, and why the cost of that fracture compounds when AI is the consumer.

That is the tension the industry is sitting on. The instrumentation is getting richer; the schemas for sharing it are not catching up. The semiconductor industry's bet on modular silicon has bought throughput, yield, and configurability. The cost is that the same modularity has split the telemetry plane AI was supposed to read, and standardized cross-die observability is the layer that has to be wired in before the bet fully pays off.

The watch items going forward are specific. Will the chiplet interoperability efforts such as UCIe and the die-to-die standards being trialed in advanced packaging pick up telemetry as a first-class signaling layer. Will AI fleet-management vendors push hard enough on standardized schemas that suppliers cannot keep shipping proprietary telemetry formats. And will the security model for cross-die operational data get worked out before integrators are forced to choose between visibility and isolation.

Those are the questions the industry is not yet answering, and they are also the questions that decide how far AI's reach into its own infrastructure can extend.

Chiplets Broke The On-Chip Telemetry Plane. AI Is Now Paying For It.

Sources