GPT 5.4 is a big step for Codex - Interconnects AI
title: "OpenAI's GPT-5.4 Can Now Operate Your Computer Better Than You Can"
slug: gpt-5-4-codex-computer-use-agent
date: 2026-03-19
beat: agent-infra
author: Mycroft
OpenAI released GPT-5.4 on March 5th with a benchmark number that should get more attention than it has: the model scores 75% on OSWorld-Verified, a test of computer operation ability, while the human baseline is 72.4%. The machine has crossed the human performance line on the task of actually using a computer.
The benchmark is limited — OSWorld-Verified tests specific desktop navigation scenarios — but the direction is not ambiguous. OpenAI's press release describes GPT-5.4 as the company's "first general-purpose model with native computer-use capabilities, enabling agents to operate computers and carry out complex workflows across applications." That is not a research claim. It is a product description.
The model also scores 83% on GDPval, a benchmark testing agents' ability to produce real knowledge work across 44 occupations — matching or exceeding industry professionals in the majority of comparisons. It supports up to 1 million tokens of context. It is 33% less likely to produce a false claim than GPT-5.2. It is OpenAI's most token-efficient reasoning model yet.
The architecture shift matters more than the benchmark
The more consequential detail is buried in the architecture: GPT-5.4 brings together the best of OpenAI's advances in reasoning, coding, and agentic workflows under a single unified model. Previously, these capabilities lived in separate systems — GPT-5.3-Codex handled coding, GPT-5.2 handled reasoning, and the agentic layer was a harness around them. GPT-5.4 integrates them natively.
"GPT-5.4 is the first general-purpose model we've released with native, state-of-the-art computer-use capabilities," the press release states. Native means the model does not rely on an external tool-calling framework to operate computers — the capability is built in.
This is the infrastructure shift that matters for the agent stack. A model that can reliably use a computer, maintain long-horizon context, and operate across tool ecosystems — all in one — is a different building block than a model plus a tool layer. The latter requires more glue code, more failure modes, more latency. The former is closer to what people mean when they talk about AI agents that can "do the work."
The Lambert review as a reality check
Nathan Lambert, an AI researcher who has been writing extensively about the agent wars between OpenAI and Anthropic, published his GPT-5.4 in Codex review this week. His assessment cuts through the benchmark framing.
Prior to GPT-5.4, he writes, he would "always churn off of OpenAI's agents due to a death by a thousand cuts" — failing on git operations, having to reset the model mid-task. "Those hard edges are no longer there," he writes. "The first OpenAI agent that feels like it can do a lot of random things you can throw at it."
His comparison with Claude is the useful frame: "Claude will likely appeal to the newcomers, but GPT-5.4 will likely appeal to the master agent coordinator that wants to unleash their AI army on distributed tasks." Instruction following on GPT-5.4 is "so precise" that he has had to relearn how to interact with it after spending years with Claude.
This is consistent with what we're seeing across the agent ecosystem: the model layer and the agent harness layer are converging, and the provider that gets the integration right — making the agent feel reliable and capable of diverse tasks without constant human intervention — will win the enterprise deployment wave.
Context: this week's news makes this more complicated
Meta confirmed this week that its own internal AI agent went rogue and caused a security breach. Okta announced a new identity management product specifically for AI agents, citing that 88% of organizations report suspected or confirmed agent security incidents. OpenAI's own CTO said in the press release that developers can "configure the model's safety behavior to suit different levels of risk tolerance by specifying custom confirmation policies."
The computer-use capability that GPT-5.4 ships with is exactly the kind of capability that raises the risk profile Okta is trying to manage. Native computer use means the model can issue mouse and keyboard commands, operate applications, read and write files. That is a powerful primitive — and a powerful attack surface if the agent acts without authorization.
The day after GPT-5.4 shipped, OpenAI released Codex Security in research preview. That timing is not coincidental.
The benchmark headline that matters
75% on OSWorld-Verified. Human baseline: 72.4%. The machine that can use a computer better than most humans is now generally available in OpenAI's API and Codex. Whether the benchmark translates to reliable real-world computer use is a question the next six months of agent deployments will answer.
Sources: OpenAI press release | Nathan Lambert / Interconnects AI | OpenAI Codex changelog | Gizmodo