OpenAI’s /goal turns the Ralph loop into a stop condition
The question every team running a coding agent eventually faces is not "can it keep working while I step away?" It is "who decides when it is allowed to stop?" The cheap answer is already running in thousands of repos: a shell loop that keeps feeding the same prompt to an agent until something works or the token meter catches fire. The more serious answer is what OpenAI shipped in Codex CLI 0.128.0: a feature that pushes the agent to prove its work against actual project state before it can declare done.
Geoffrey Huntley, an engineer who has been documenting agent workflows, named the cheap version last year. His "Ralph" loop, while :; do cat PROMPT.md | claude-code ; done, is a Bash script that keeps retrying a coding agent without checking whether the work is actually complete. It is named after Ralph Wiggum from The Simpsons, which is about the right amount of dignity for unattended software labor. Simon Willison, the independent developer and writer, flagged Codex's new /goal command as that same pattern reaching OpenAI's official toolchain. He is right, with one upgrade that matters: OpenAI is trying to make the loop accountable.
The Codex 0.128.0 release notes describe /goal as persisted workflows with app-server APIs, runtime continuation, and terminal controls to create, pause, resume, and clear goals. In plain English: Codex can now hold a task in memory across turns and keep working on it without a human re-prompting each step. The feature sounds like a quality-of-life improvement until you read the prompt that comes with it.
The key file is [continuation.md](https://github.com/openai/codex/blob/6014b6679ffbd92eeddffa3ad7b4402be6a7fefe/codex-rs/core/templates/goals/continuation.md), a template Codex injects when deciding whether to keep working. It tells the agent not to accept proxy signals as proof of completion. Passing tests, a complete manifest, a successful verifier, or a lot of implementation effort count only if they actually cover every requirement in the original objective. The agent has to audit the result against real system state: the files, commands, tests, and user request. Its own report that the job is done is not enough.
That distinction is the story. The Ralph loop trusted the agent to try again and again and to self-report honestly. Codex's version adds a termination contract. The [budget_limit.md](https://github.com/openai/codex/blob/6014b6679ffbd92eeddffa3ad7b4402be6a7fefe/codex-rs/core/templates/goals/budget_limit.md) prompt handles the other end: when the token budget runs out, Codex is told to summarize useful progress, name remaining work or blockers, and leave the user with a next step. It does not declare victory. It leaves a receipt.
That receipt is the quiet architecture worth watching. Long-running agents fail in two unglamorous ways. They stop early because they mistake motion for progress: the model says it is done but the files do not reflect the requirement. Or they keep going because they cannot recognize a wall. A feature that says "continue until done" is only useful if "done" means something stricter than "the model feels finished." The robot says it is done. The filesystem gets a vote.
This is where the pattern starts to look less like a hacker trick and more like job control for software labor. If a coding agent runs unattended for an hour, the useful output is not just code. It is an auditable answer to three questions: what changed, what is still broken, and why did the agent stop?
The convergence is broader than Codex. Huntley's Ralph loop popularized the brute-force version in mid-2025. Anthropic's Claude Code ecosystem has been moving in the same direction: MindStudio describes AutoDream as a planning loop that runs until the goal is met or the agent gets stuck, Channels as a real-time event stream from a running session, and Dispatch as a way to trigger Claude Code programmatically as a background worker. Different wrappers, same pressure: developers want agents that can take a job, work through the middle, and produce an intelligible stop condition.
The counterforce is cost and attention. Ars Technica's Samuel Axon reported in April on Anthropic testing the removal of Claude Code from the Pro plan, with users complaining that looping debugging sessions could consume tokens rapidly. That complaint is not incidental. Persistent agents turn ambiguous tasks into metered spend. If the objective is poorly specified, the loop does not become smarter by running longer. It becomes a paid way to discover that nobody wrote the acceptance criteria.
OpenAI seems to recognize this. In the same 0.128.0 release, the company deprecated --full-auto, the flag closest to giving Codex a blank check, and pointed users toward explicit permission profiles and trust flows. That is the other half of the design. Longer-running agents need narrower lanes.
For builders, the 90-day watch is not whether every coding tool adds a cute /goal command. It is whether agent platforms start exposing completion audits as a product surface: receipts, stop reasons, budget summaries, and state checks that can be reviewed after the run. A team choosing an agent stack should ask a boring procurement question before the demo: when the agent says it finished, what evidence can I inspect?
For investors, the same question points to where margin may move. The valuable layer may not be the wrapper that keeps an agent alive. That part is already becoming common. The valuable layer is the system that constrains the run, prices it, records why it stopped, and proves the work matches the user's request. If that sounds like old enterprise software with a token meter attached, yes. That is why it might actually matter.
The unresolved question is whether those checks survive messy real projects. A prompt can tell an agent not to trust proxy signals. It cannot guarantee the tests cover the actual objective, or that the objective was well written in the first place. Watch the failure reports, not the launch notes. The next useful benchmark for coding agents may be much duller than "can it build the app?" It may be: can it tell, reliably, when the app is actually done?