Espresso claims 4.7x faster AI on Apple's Neural Engine, bypassing CoreML
Espresso runs small transformer models directly on Apple's AI chip via undocumented interfaces, beating Apple's official AI framework on a single self benchmark.
Espresso runs small transformer models directly on Apple's AI chip via undocumented interfaces, beating Apple's official AI framework on a single self benchmark.
A Swift library posted to GitHub this week claims to train and run transformer models directly on Apple's Neural Engine, the dedicated AI silicon inside Macs, iPhones, and iPads. The author reports a roughly 4.76x speedup over Apple's own CoreML framework on a small six-layer demo model, with 1.08 ms/token versus 5.09 ms/token. Those numbers come from the project's own benchmark on its own code, not from independent reproduction.
The project, called Espresso, reaches that result by sidestepping Apple's public AI tooling entirely. Instead of going through CoreML, Espresso compiles transformer operations straight to the Neural Engine using two private Apple interfaces called _ANEClient and _ANEInMemoryModel. Those names are not in any public Apple SDK; the project recovered them by reverse-engineering how Apple's own frameworks talk to the Neural Engine. Because the APIs are private, they can break in any macOS or iOS release, and Apple does not support apps built on them, which effectively rules out App Store distribution.
The package ships with a full training loop, not just inference. EspressoTrain handles forward and backward passes, gradient accumulation, and the Adam optimizer, which is the load-bearing novelty compared with earlier on-device AI work that only ran pretrained models. The trick that makes the speedup possible is packing six transformer layers into two ANE dispatches via fused three-layer kernels, then streaming weights in and out through zero-copy IOSurface buffers with NEON-vectorized reads and vDSP for the final argmax. There is no per-token recompilation, which is what CoreML pays for every time a model changes shape.
The codebase is small and modern: Swift 6.2 with move-only tensors, strict concurrency, typed throws, and zero external dependencies. The module structure is public, so anyone can verify the training loop and the kernel fusion claims by reading source rather than trusting the README. That matters because the headline number sits on a single developer's self-benchmark on a single small model.
Skepticism is already showing up on the Hacker News thread for the project, which had 13 points and a single comment at scout time. Commenter losteric called it "a vibe coded project with failing builds and no associated interesting development story or example applications," a useful reminder that a real training loop in source is not the same as a working pipeline that someone outside the author has shipped. An aggregator roundup framed the discovery in slightly larger terms, but it is discovery context, not independent verification.
Apple's Neural Engine has always been restricted to inference, never gradient computations. The Neural Engine has been treated as Apple's inference-only accelerator for years, closed to anyone who wanted to push gradients through it. Espresso demonstrates a path, on a six-layer toy model, that gets both forward and backward passes running on that hardware without going through CoreML at all. The next test is whether the build survives a major macOS release with its private-API hooks intact, and whether anyone outside the author reproduces the speedup on a larger model.