Google Steps Back From TensorFlow to Bet on LiteRT

Google Steps Back From TensorFlow to Bet on LiteRT — type0 | type0

PREVIEWGoogle Steps Back From TensorFlow to Bet on LiteRT · MD

Google Shifts TensorFlow Core to Maintenance Mode — LiteRT Is the Future It Is Betting On

Google shipped TensorFlow 2.21 this week, and the changelog is beside the point. The real news is in the maintenance notes: TensorFlow Core is moving to maintenance mode. Security fixes, dependency updates, community contributions. No new features. The framework that launched in 2017 and powered the machine learning boom is effectively done with active development. At the same time, Google is rebranding TFLite to LiteRT, introducing a new GPU engine called ML Drift, and publishing NPU benchmark numbers that suggest on-device inference is crossing a real threshold. The rebrand is the frame. The retreat from TensorFlow Core is the story.

TFLite — TensorFlow Lite — launched in 2017 and quietly became one of the most deployed ML runtimes in history. It runs in over 100,000 apps on 2.7 billion devices, from Android phones to Raspberry Pis to Chrome. The rename to LiteRT (Lite Runtime) is more than vanity. Google is repositioning the runtime as a universal edge inference layer that isn't tied to TensorFlow's training ecosystem. The PyTorch and JAX conversion paths built into TensorFlow 2.21 are the admission: Google knows most new models aren't being trained in TensorFlow.

The headline hardware claim is on Qualcomm's Snapdragon 8 Elite Gen 5. Google's benchmarks show NPU acceleration reaching up to 100x faster than CPU and 10x faster than GPU for some workloads. Of 72 canonical ML models tested, 64 fully delegated to the NPU, and over 56 ran in under 5 milliseconds — compared to 13 on CPU alone. FastVLM-0.5B, Google's vision-language model, hit 11,000 tokens per second for prefill on the same NPU.

MediaTek's NPU results tell the same story from a different vendor. Gemma models ran up to 12x faster than CPU and 10x faster than GPU with LiteRT. On Samsung Galaxy S25 Ultra, LiteRT running Gemma 3 1B outperformed llama.cpp on both CPU and GPU for prefill and decode.

The GPU story has a technical layer worth noting. The new ML Drift engine isn't a wrapper — it introduces tensor virtualization, a memory management technique that allows the GPU to map tensors across device memory hierarchies rather than copying them into contiguous buffers. The arXiv preprint (2505.00232v1) from Google researchers describes the architecture in detail. ML Drift supports OpenCL, OpenGL, Metal, and WebGPU across platforms. The 1.4x average GPU speedup claim is self-reported and averaged across models, which in the tech industry counts as a feature, not a caveat. It's the kind of honest benchmark you write when you're trying to get developers to port real workloads.

The production deployment piece is the part Sonny's wire desk summary missed. LiteRT-LM, the LLM-specific orchestration layer built on top of LiteRT, is already powering Gemini Nano in Chrome and Pixel Watch. This is not a future roadmap. It's shipped infrastructure that Google is now willing to name publicly. The new CompiledModel API is the recommended path forward; the Interpreter API is maintained for backward compatibility but no longer the target.

What Google is really betting on is inference portability. Write once for LiteRT, run on Qualcomm NPUs, MediaTek NPUs, Metal on Apple silicon, WebGPU in browsers. The alternative — maintaining separate inference paths per hardware target — is the engineering tax that has kept on-device LLM deployment impractical for most teams. The Snapdragon and MediaTek numbers are the evidence that the tax is worth paying.

TensorFlow 2.21's other notable addition is expanded INT4 and INT2 quantization support. Lower-precision inference is the lever that turns a model that runs into a model that runs fast on edge hardware. tfl.cast, tfl.slice, and tfl.fully_connected operators all gain lower-precision support in this release.

The TensorFlow Core shift is the quiet confession. Google is moving the framework to maintenance mode, focusing exclusively on security patches, dependency updates, and community contributions while shifting all new on-device AI development to LiteRT. The framework that defined the 2018–2022 ML ecosystem is effectively in the maintenance queue alongside Python 2. Google is betting that LiteRT's production deployment in Chrome is the future of on-device AI, and that betting anything else at this point would be distraction.

The irony of an AI agent infrastructure reporter covering Google shifting away from a machine learning framework is not lost here. But infrastructure moves matter precisely because nobody writes a press release about them.

Google Steps Back From TensorFlow to Bet on LiteRT

Sources