Tenstorrent has spent years as a footnote in Nvidia's rearview mirror. Its CEO, Jim Keller, the veteran chip architect who designed Apple's early silicon and later led AMD's Zen, is now betting the company on a contrarian reframing of where the next AI infrastructure dollars will go.
The claim, laid out in an interview with EE Times, is that AI inference is a networking and memory problem, not a compute one. Inference is the act of running a trained model to answer a prompt; training is the much more expensive step of building that model in the first place. Tenstorrent's new BlackHole Galaxy server, with 56 Ethernet ports per box versus roughly 8 on a typical GPU server, is built to exploit that gap at what Keller says is a fraction of the hardware cost of either GPU-centric systems or purpose-built AI accelerators.
The numbers Keller is leaning on come from Tenstorrent's own TT-Deploy benchmarks, not independent testing. At the company's recent demonstration, 16 Galaxy boxes wired together, 512 BlackHole chips total, ran DeepSeek-671B inference at up to 350 tokens per second per user at a batch size of 32, according to EE Times. Cerebras, the wafer-scale AI-chip company that went public, has published a CS3 figure of 981 tokens per second on Kimi K2.6 1T, a trillion-parameter model in roughly the same size class as DeepSeek-671B. The two numbers are not direct apples-to-apples: different model, different batch, different workload. But the comparison is the only one Keller is offering, and it is the one the industry will make (EE Times on Cerebras's IPO).
"Fraction of the hardware cost" is unverified in the source material. Cerebras has not publicly responded to Keller's framing. Readers should treat both the throughput numbers and the cost claim as vendor-reported, not third-party validated.
Underneath the benchmarks sits an architectural thesis. Keller leans on Rent's Rule, an IBM observation from the 1960s that the number of external connections on a chip must scale with the number of components inside it. Modern AI systems, he argues, are I/O-starved: the silicon can do the math, but the wires between chips cannot move the data fast enough. Galaxy's 56 Ethernet ports per box, about seven times the external connectivity of a typical GPU server, are an attempt to buy headroom on that bottleneck (Keller interview with EE Times).
The timing is not accidental. Cerebras's IPO revived public-market interest in AI-chip challengers to Nvidia, putting a multibillion-dollar valuation on the bet that the GPU's lock on AI inference can be broken. Tenstorrent is part of the same competitive set, but is pitching a structurally different answer. Rather than wafer-scale integration or exotic memory pools, it is betting that commodity Ethernet and high port counts can carry inference at production scale (EE Times on Cerebras).
Keller's framing, that AI "still obeys the old laws of compute," pushes back against the narrative that newer architectures have rewritten inference economics. The argument is that the 2024–2025 infrastructure buildout over-indexed on raw FLOPs and under-indexed on bandwidth, and that the bill will come due when customers start measuring dollars per million tokens rather than peak FLOPS. That is a testable claim. It will be falsified, or vindicated, by production deployments that publish independent cost-per-token numbers rather than controlled demos.
What to watch next: whether a hyperscaler, neocloud, or large enterprise publishes a third-party benchmark on Galaxy hardware; whether Cerebras or any other inference-accelerator vendor publishes head-to-head numbers on the same model and batch setting; and whether the wave of ex-Tenstorrent engineers spinning out their own ventures, including a new cloud provider and AI lab in Japan, turns into a competitive liability or an ecosystem tailwind (EE Times on the Japan spin-out).