Tenstorrent Unveils Next-Gen Servers for Fast Tokens, No Disaggregation Needed
The anti-disaggregation case for AI inference hardware just got its first real-world test.
Tenstorrent's Galaxy Blackhole servers — 32 chips per 6U rack-mount chassis, 23 petaflops of FP8 compute, starting at $110,000 per server — entered general availability Tuesday, with independent confirmation from The Register. The company's argument: the industry went the wrong direction by splitting AI inference across multiple specialized chips. Tenstorrent uses the same chips for all stages of AI work, and has now shipped the hardware to prove it.
The dominant approach, called disaggregation, uses different chips for different stages of AI computation. Nvidia pairs its GPUs with Groq chips for fast token generation. Google's TPUs and Amazon's Trainium chips take the same approach. The theory is sound: different AI tasks benefit from different chip designs. But Tenstorrent's argument is simpler: what if the theory is right but the problem is fake?
"The world is talking about specialized, specialized, specialized, and that should be terrifying them because as models change, that specialized hardware is not going to work," Jim Keller, Tenstorrent's CEO and a veteran chip designer best known for his work at AMD and Apple, told EE Times. "We build a big cluster of computers, and you can run LLM prefill and decode, video generation, agentic AI... we are anti-specialized."
On a four-node Galaxy Supercluster — four servers linked together, totaling 128 Blackhole chips — the company says it processed a 100,000-token prompt from the open-weight DeepSeek V3 model in under four seconds. For video generation, the same cluster delivered 720p footage faster than real-time, according to Tenstorrent. The company's highest-throughput inference mode, called Blitz Mode, supports context lengths up to 128,000 tokens and batch sizes between 8 and 64 simultaneous requests.
Sixteen Galaxy Blackhole servers are already installed at an Equinix data center in Ashburn, Virginia, the company said. Customers include Cirrascale, a cloud provider; AI&, a company founded by a former Tenstorrent executive in Japan; and Turiam, an image-as-a-service provider deploying up to 32 servers in India. The company expects to share more details at its TT-Deploy event on May 1.
The Register independently confirmed the hardware specifications and noted a gap in Tenstorrent's benchmark disclosures: the company has not published the batch sizes used in its token throughput tests. Batch size, how many inference requests a system handles simultaneously, dramatically affects measured throughput. Without that number, comparisons between Tenstorrent's 255 tokens per second figure, measured by EE Times, and the 350 tokens per second the company claims for its highest-throughput mode are difficult to evaluate against other systems. The Register also noted the Blackhole chip was quietly downgraded before launch.
Tenstorrent also claims roughly 80 to 90 percent of models downloaded from Hugging Face, a popular model repository, will run on its hardware without modification. The Register called this a significant claim worth testing directly. Running a model and running a model well are different things, and Tenstorrent's own benchmarks acknowledge the distinction.
The real question is whether the anti-disaggregation thesis survives contact with production workloads at scale. The industry went all-in on specialized chips because the theory was compelling and the vendors were loud. Tenstorrent is betting the physics of general-purpose compute are sufficient, and that the market's faith in disaggregation is vendor momentum dressed up as engineering necessity.
The TT-Deploy event on May 1 will tell us how much more there is to show.