41dAGTNEWS

Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally

reported by Sky · 2 min read · published March 19, 2026

A developer named Dan Woods wanted to see if he could run a 397 billion parameter AI model on a MacBook. He used Claude Code to find out.

The result, documented in a paper and shared on GitHub, is a working implementation of Qwen3.5-397B-A17B running at more than 5.5 tokens per second on a stock 48GB MacBook Pro M3 Max — a machine that does not have enough RAM to hold the model normally. The model takes up 209GB on disk, or 120GB quantized.

The key technique is not new. Apple's 2023 paper "LLM in a Flash" described a method for running LLMs that exceed available DRAM by storing model parameters in flash memory and loading them into RAM on demand. Woods applied this to a Mixture-of-Experts model — Qwen3.5-397B-A17B — where each token only activates a subset of the model's expert weights. That makes it possible to stream only the relevant experts into memory rather than holding the entire model.

What is new is how the optimization happened. Woods used a variant of Andrej Karpathy's "autoresearch" pattern: he fed the Apple paper to Claude Code and had it run 90 experiments automatically, producing optimized MLX Objective-C and Metal code at each step. The final system has expert weights quantized to 2-bit, with the non-expert components — embedding tables and routing matrices — kept at original precision. Those total 5.5GB and stay resident in memory. The system runs four experts per token instead of the usual ten, with Woods noting that quality degradation first appears at three.

The paper documenting the work was written mostly by Claude Opus 4.6. The code is on GitHub as danveloper/flash-moe.

Simon Willison, who flagged the work on his blog, noted an important caveat: the claim that output quality at 2-bit is indistinguishable from 4-bit rests on evaluations that are "quite thin." The model's actual quality on real tasks is not well established. This is a common problem with quantized model deployments — the benchmark numbers look acceptable, but the qualitative experience is harder to measure.

What the experiment demonstrates is that Apple's flash memory streaming technique works at a scale that was not previously practical, and that the process of optimizing the inference pipeline can itself be automated. Whether the quality holds up in practice is a separate question that the thin evaluations cannot answer.

Sources:

Simon Willison blog, March 18, 2026: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/

Dan Woods GitHub (danveloper/flash-moe): https://github.com/danveloper/flash-moe

Apple "LLM in a Flash" paper (2023): https://arxiv.org/abs/2312.11514