Andrej Karpathy pushed a 630-line Python script to GitHub on March 7, went to sleep, and woke up to find his AI agent had run 50 experiments without a single human instruction. The agent had found a better learning rate and committed the results to git. Karpathy called the system autoresearch. Seventeen hours later, it had independently rediscovered RMSNorm and tied embeddings — two techniques that took Google Brain and OpenAI nearly eight years to formalize, per Forbes. That is the story.
The autoresearch tool is structurally simple. It runs on a stripped-down nanochat training framework condensed to a single editable file. The agent reads a plain-language instruction file, modifies training code, runs each experiment for exactly five minutes, checks whether validation loss improved, and repeats. At 12 experiments per hour on a single NVIDIA H100 GPU, one overnight run yields roughly 100 autonomous iterations, at roughly $0.40 per experiment. In two days of unattended operation, the agent completed approximately 700 changes, identified around 20 additive improvements that transferred cleanly to larger models, and drove the time-to-GPT-2 benchmark from 2.02 hours to 1.80 hours — an 11 percent efficiency gain on code Karpathy described as already well-tuned, according to Forbes.
Karpathy described the goal plainly: "The goal is not to emulate a single PhD student, it's to emulate a research community of them," he wrote on X. The three primitives are an editable file, a scalar metric, and a fixed time limit per experiment — what Janakiram MSV at The New Stack called the Karpathy Loop. The agent caught oversights in attention scaling and regularization that Karpathy said he had missed manually over two decades of ML work. "This is an actual LLM writing arbitrary code, learning from previous experiments, with access to the internet," Karpathy wrote. "It's not even close" to classical AutoML methods like neural architecture search.
The implication, as Karpathy noted, is that every frontier lab will run this: "All LLM frontier labs will do this. It's the final boss battle." The bottleneck shifts from the human researcher's ability to write code to their ability to define the search constraints. What remains human is setting the metric and interpreting the results.
Shopify CEO Tobias Lütke tested the same pattern on an internal model, running 37 experiments overnight and reporting a 19 percent performance gain, per Forbes. Eric Siu, founder of ad agency Single Grain, estimated that most marketing teams run roughly 30 experiments per year. "The next generation will run 36,500-plus," he wrote on X. "Easily. They'll run experiments while you sleep."
Karpathy's loop is optimization. A separate body of work shows something more unsettling: that AI agents running with institutional memory don't just optimize faster — they learn.
The foundational memory architecture problem is distinct from either case. A paper posted to arXiv in October 2025 by researchers at Beijing Jiaotong University, Hithink Research, and Huawei's Noah's Ark Lab took a different approach. The team — led by Yuxiang Zhang — proposed Memory-as-Action (MemAct), treating working memory management not as storage but as a set of learnable policy actions. Rather than passively accumulating context, the agent learns when to retain, compress, or discard segments of its history, or synthesize new content, through end-to-end reinforcement learning. The key training challenge: when you delete context, content already influenced subsequent token representations, creating a train-inference mismatch. The researchers solved it with Dynamic Context Policy Optimization (DCPO), which segments trajectories to enable standard RL infrastructure to handle non-monotonic context updates. Their result: MemAct-RL-14B matches the accuracy of Qwen3-235B — a model 16 times larger — while using 49 percent less average context length, and reduces average context length by 51 percent, per the paper. The learned strategies adapt to the backbone model; different models develop different memory-editing patterns based on their specific strengths and limitations.
Andrew Dhillon, working on a multi-agent pipeline for cybersecurity challenge design, built a memory layer called SAGE — consensus-validated institutional knowledge with domain tagging, confidence scoring, and governance signatures. He ran a controlled experiment: 50 agents with consensus-validated memory against 50 without. The result: an agent with an 18-line onboarding prompt and access to institutional memory outperformed an agent with a 120-line expert-crafted prompt at the highest difficulty level, Dhillon reported on Medium.
That was the validation. What came next was the finding.
Dhillon ran a simulated cybersecurity company — CipherForge Labs — with 11 specialized agents across five departments. Control arm: expert-crafted prompts up to 200 lines per agent, no institutional memory, each run independent. Treatment arm: three-line prompts with memory enabled, sequential runs where knowledge accumulates. The memory-disabled arm produced the same average quality on each run because agents couldn't learn from outcomes. The memory-enabled arm had a different problem: it grew an echo chamber, reinforcing techniques that worked without knowing why.
The fix was a red team. Dhillon ran a separate Claude Opus 4.6 agent blind against each challenge the SAGE team built, documented the steps to solve it, and fed those failure reports back into institutional memory. After nine runs, he then ran 10 consecutive difficulty-3.0 challenges — the hardest tier — with red team feedback enabled.
The first run after seeding, the designer agent switched from AES-CBC to AES-GCM for cryptographic challenges. It had learned not just what to use but why. By the final runs, the treatment arm achieved near-perfect calibration on what the red team found easy to break versus what held. The Spearman correlation between predicted and actual difficulty was 0.716 (p=0.020) for the memory-enabled treatment arm, compared to 0.040 (p=0.901) for the control — a statistically significant shift in an agent's ability to anticipate failure modes based on accumulated experience, Dhillon reported.
The distinction from Karpathy's loop is important. Autoresearch optimizes a fixed objective — validation loss, training speed. SAGE agents with red team feedback were learning a distribution of failure, updating their priors across runs, and improving without a human in the loop. "An entire CTF challenge company running on three-line prompts," as Dhillon titled one section.
The broader displacement question has been analyzed by Sahaj Garg, cofounder and CTO of AI toolmaker Wispr. In a March 2026 essay, Garg described his own practice: output that he estimates at four weeks of engineering work now takes roughly 45 minutes with a well-directed prompt in an AI coding tool. His primary intellectual partners, for reasoning and strategic thinking, are AI systems. He described his own intelligence as "a commodity on tap," with residual skills being "taste, direction, synthesis — not raw cognitive horsepower." The autoresearch pattern, Garg argues, is the engineering-research version of the same shift.
Karpathy released the autoresearch code on GitHub; it has since accumulated nearly 37,000 stars. The obvious question — whether this scales to frontier-scale training codebases, where the search space is orders of magnitude larger — remains open. Karpathy acknowledged the complexity gap: "Doing it is just engineering, and it's going to work." Whether it works in 17 hours or 17 weeks at actual production scale is the experiment nobody has run yet.
What the two demonstrations taken together suggest is that the loop is closed on both ends. Autoresearch shows you can find better techniques without human hypothesizing. SAGE-with-red-team shows that with the right memory architecture and feedback signal, agents don't just find better techniques — they build on what they found. The researcher who goes to sleep is no longer the bottleneck.