For years, running AI models on a personal laptop meant settling for a curious toy: useful for demos, not for real engineering work. That changed for Vicki Boykis one afternoon this month, on a four-year-old machine most people would call ordinary.
Boykis, an ML practitioner who writes regularly about applied machine learning, fed a tangled Python notebook to Google's gemma-4-26b-a4b running locally through LM Studio on her 2022 M2 Mac. The 26-billion-parameter model, with only about 4 billion parameters active at any time, refactored the notebook into a clean five- or six-module repository, linted the generics, and wrote unit tests. She did not have to reach for a frontier API to fix any of it. In her own framing, the test came down to one question: did she still need to double-check the output against an API model? For this task, the answer was no.
That moment is the news. Boykis pegs her local setup at roughly 75% of frontier-model accuracy and speed for her real workflow, writing on her blog that the "vibe metric" she uses to judge local output shifted twice in the past year: first with the release of OpenAI's GPT-OSS, and decisively with Google's Gemma 4 family. Gemma 4 is Google's latest open-weights model line, which means anyone can download the weights and run them on their own hardware rather than calling a remote API. The 26-billion-parameter variant she uses keeps an active parameter count around 4 billion, which is why a four-year-old laptop with 64GB of unified memory can hold it at all.
The hardware floor is concrete: a 2022 M2 Mac, 64GB of RAM, 1TB of storage. The implication is not that any laptop can do this. A machine with 8GB or 16GB of memory cannot. But the consumer hardware ceiling for useful local coding AI has dropped into the range of machines many developers already own, and that shift is the threshold the story is about.
The honest caveats sit in the same post. Boykis says she has "no concrete scientific evidence" beyond her own practitioner observation, and the basis is her workflow, not a benchmark. She still uses frontier API models for harder work. And there is a real ceiling: the K-V cache, the working memory a transformer model keeps while generating tokens, can fill her 64GB of RAM on long agentic runs. Agentic coding, in her setup, means the model loops on its own to read, edit, and test code, often inside a sandboxed Docker container with limited execution access. Push the loop long enough, and the local machine runs out of room before the model runs out of ideas.
The threshold matters because of what it does to a developer's day. Boykis describes using local models as a "personalized Google," a fast, on-device lookup for code patterns, refactors, and small tasks, with a frontier API held in reserve for the harder problems. She has also bootstrapped a two-tower recommendation system, an architecture with two neural encoders (one for users, one for items), from a blank repository on the same local setup, a task she uses as a real-world test of whether the model can carry a non-trivial project on its own.
Other open-weights models sit in her local rotation: Mistral 7B, Gemma 3, OpenAI's GPT-OSS-20B, Qwen 3 MOE, and Qwen 2.5 Coder. Runtimes include raw llama.cpp, Open WebUI, llama-cpp-python, Ollama, llamafiles, and LM Studio. The point of the list is not the inventory. It is that the field of usable local models has widened to the point where a working developer can keep several on disk and pick by task.
The story is not "local AI has caught up." It is that a specific, reproducible threshold has been crossed on a specific, common piece of hardware, for a specific class of work. Boykis's 75% figure is the honest center of gravity: good enough that she stops checking the output, not so good that she gives up the frontier API. A four-year-old laptop, given the right open-weights model, can now carry real coding work. The rest of the curve, the part that still needs a frontier API and a data center, is also part of the story, and the reason she keeps both in the loop.