The Compilation Trick That Finally Made Phone AI Work Offline

The Compilation Trick That Finally Made Phone AI Work Offline — type0 | type0

Google Meet is now running a model 25 times larger than its predecessor through the specialized AI chip inside Android phones — without draining the battery in a 20-minute video call. Epic Games is streaming real-time facial animation at 30 frames per second through the same hardware. Speech recognition company Argmax measures its AI models running more than twice as fast as their GPU equivalents. The unlock is not new silicon. It is a compilation step called ahead-of-time, or AOT, compilation — models can be optimized before they reach a user's device, rather than freezing the app for three seconds while the phone figures out what to do on first launch.

The change traces to LiteRT, Google's framework for on-device AI and the successor to TensorFlow Lite. On April 23, Google published its first hard production numbers for the updated runtime, and they are significant: Meet's Ultra-HD segmentation model, Epic's MetaHuman pipeline, and Argmax's commercial SDK all running through the neural processing unit, or NPU — the dedicated AI chip inside every modern phone. The three cases matter because they are not benchmarks. Meet is used by hundreds of millions of people. MetaHuman is a commercial product. Argmax sells its SDK to healthcare company Heidi Health for extended live transcription sessions where battery life is a first-order constraint.

Before LiteRT, the NPU was present in phones but largely inaccessible to developers. The TensorFlow Lite runtime could not pre-compile models for NPU targets — every cold launch forced expensive on-device compilation, and the first-use experience was so bad that most developers sent the work to the cloud instead. LiteRT fixed this with AOT compilation support, and the fix is what allowed Argmax to ship a real commercial SDK rather than a research project. Argmax described the old limitation directly: the TensorFlow Lite NPU runtime did not support AOT compilation, requiring apps to run a costly just-in-time compilation step before users could do anything. This was a showstopper.

What changed is that LiteRT consolidated NPU support across the three chip vendors that dominate the Android market: Qualcomm Snapdragon, Google Tensor, and MediaTek. These three represent the overwhelming majority of NPU-capable Android devices. Before, writing for one NPU meant rewriting for each vendor's proprietary SDK — the first question any AI developer had to answer, which NPU does this phone have, consumed weeks of engineering time. LiteRT reduced that to a single API call.

The caveat is real and material. Google's developer blog calls the LiteRT stack production-ready, but the GitHub repository still labels the runtime as version two alpha. The company has not shipped a formal release. The device figure Google frequently cites — 2.7 billion — traces to TensorFlow Lite before the January rebrand to LiteRT, and AI CERTs noted the company has not published a post-rebrand device count. Teams considering LiteRT for anything beyond experimentation are operating in a gap between marketing language and source code maturity.

The longer bet Google is making is that this solves NPU fragmentation the way CUDA solved GPU fragmentation in the late 2000s: by abstracting the hardware specifics behind a consistent interface, it enables a generation of developers to treat the accelerator as infrastructure rather than a special project. CUDA did not make GPUs faster. It made them usable without a hardware engineering degree. LiteRT is attempting the same for neural processing.

The implication, if it holds, is that the category of AI work that can run locally on a phone expands significantly. Not entirely — large models still live in data centers — but inference moves partially off the cloud, which puts pressure on the economics of every AI company whose per-call compute cost is part of their unit economics. The question is no longer whether the NPU is powerful enough. It is whether the software stack is ready to use it. With LiteRT, Google is betting the answer is finally yes.

The Compilation Trick That Finally Made Phone AI Work Offline

Sources