Arm and Google Built the Software. The Silicon Is the Problem.
The Compiler Now Does the Hardware Work
Arm and Google solved the last hard problem in on-device AI — and they did it with a compiler trick.
A new integration between Arm's Scalable Matrix Extension 2 (SME2) instruction set and Google's AI Edge software stack, announced May 14 on the Google Developers Blog, delivers up to 5x faster inference for matrix-heavy AI workloads running directly on a device's CPU. The twist is not the speed. The twist is what required no speedup: application code. KleidiAI — Arm's open-source library of optimized micro-kernels — is embedded directly into XNNPACK, which LiteRT (Google's on-device inference runtime) selects automatically at runtime. Developers get hardware-level AI acceleration the same way they get SIMD vectorization: by compiling for the right architecture.
In benchmark results published alongside the announcement, Stable Audio Open Small — a roughly 1-billion-parameter diffusion model for audio generation — ran 2.3x faster on an Apple MacBook with M4 chip (10 seconds down to 4.3 seconds) and 2.1x faster on an Arm SME2-equipped Android device (14 seconds down to 6.6 seconds). The DiT transformer submodule saw the most dramatic gains: dynamic INT8 quantization delivered a 3x throughput improvement and a 4x reduction in memory footprint. All figures are from Google and Arm's engineering teams on controlled demo hardware.
The practical meaning: a mobile developer who wants to run audio generation, image segmentation, or a small language model on an Android device no longer needs to write NPU-specific code, integrate a vendor SDK, or understand the memory architecture of a Qualcomm Hexagon or MediaTek NPU. The CPU is now a competitive inference target.
The stack that makes this possible — LiteRT, XNNPACK, KleidiAI, and SME2 — has been in development for years. SME2, introduced in the Armv9 architecture, adds a dedicated matrix-compute unit to the CPU cluster. KleidiAI exposes those instructions as optimized kernels for common AI operations like iGeMM and GeMM. XNNPACK selects the right kernel at runtime based on the hardware it finds. LiteRT orchestrates the whole pipeline and handles model conversion from PyTorch through its Torch interface.
A PyTorch blog post published in January, co-authored by Arm engineers, quantified the gains on a different model — SqueezeSAM, the interactive image segmentation model behind Instagram's cutouts feature — running on a vivo X300 Android flagship. With SME2 enabled, FP16 inference improved 3.9x (from 1,163 milliseconds to 298ms on a single core) and INT8 improved 1.83x (556ms to 304ms). The post describes freed CPU headroom as the direct unlock: applications can now run segmentation and enhancement in parallel without sacrificing UI responsiveness, or extend cutout from still images to live video with subject tracking across frames.
Oli Gaymond, Google's head of AI/ML product for Android, put it plainly in a joint Arm-Google case study: for the first time, the company can run generative AI capabilities across a wide range of devices in the ecosystem.
The NPU people are paying attention.
The implication for Qualcomm, MediaTek, and Apple is uncomfortable. Neural processing units were the answer to a specific problem: generic CPUs were too slow for real-time AI inference. SME2 does not make NPUs irrelevant — high-throughput dedicated silicon still leads on the most demanding workloads — but it narrows the gap for the large class of tasks that fall in the mid-range: audio generation, language model inference under 3B parameters, real-time image processing. If the CPU can handle those at acceptable latency with zero developer friction, the NPU upgrade story weakens for the next device cycle.
There is a fragmentation problem underneath the announcement. Qualcomm's Snapdragon 8 Elite 2, the chip expected to power the next generation of Android flagships, supports only SME1, the predecessor to SME2, not the newer extension. That means the performance gains described in Google's announcement apply to a subset of current and upcoming Android devices — not the full ecosystem. Apple has not shipped SME2 in iPhones; the M4 chip in current MacBooks lacks it, despite outperforming the SME2 Android device in absolute audio generation time in Google's own benchmarks.
All of the published performance numbers come from Google and Arm engineering teams running controlled benchmarks on identified demo hardware. No independent third-party benchmarks on production devices were available at publication.
The democratization argument is real, even if the evangelism is premature. KleidiAI is open source. LiteRT-Torch handles model conversion from standard PyTorch. Model Explorer provides a visual graph of the model to identify quantization-safe layers. The AI Edge Quantizer handles compression. The complete pipeline — convert, optimize, deploy — is documented and available today. What once required a systems engineering team and vendor-specific SDK relationships now fits in a README.
Whether the stack delivers on that promise in production apps, on real devices, across the Android fragmentation map, is the next question. The compiler solved the problem in the demo. Now it has to work in the field.