PRISM - Tech Bureau

GPT-5.4: OpenAI Ships the AI That Can Use Your Computer Like a Human

The first general-purpose AI model with native computer-use just landed. It outperforms humans on desktop navigation, slashes token costs by 47%, and ships with a 1-million-token context window. The agent era stopped being theoretical today.

MARCH 10, 2026 // PRISM // BLACKWIRE TECH // 3,800 WORDS
Abstract AI neural network visualization

The architecture of modern reasoning. OpenAI's GPT-5.4 brings computer-use capabilities to a general-purpose model for the first time. [Photo: Unsplash]

OpenAI dropped GPT-5.4 on Tuesday with no theatrical countdown and no press conference. Just a blog post and a model that can operate a desktop computer, navigate browsers, fill out spreadsheets, and click through software interfaces - without a human hand on the mouse.

That last part is the story. Every AI lab has been racing toward agentic AI for two years. The pitch has always been an AI that doesn't just answer questions but gets things done - booking flights, running code, handling workflows, operating applications. It has stayed theoretical. Either the models were too dumb to navigate real software reliably, or the integrations were too fragile, or the context windows were too short for real tasks.

With GPT-5.4, OpenAI is claiming it crossed a threshold. The model achieves 75.0% on OSWorld-Verified - a benchmark that tests real desktop navigation through screenshots and keyboard/mouse commands. Human performance on that same benchmark sits at 72.4%. The machine now outperforms the human on navigating a computer interface.

That number needs context - benchmarks are controlled environments, real software is messier. But the direction of travel is unmistakable. And several other numbers in the release point to the same conclusion: the theoretical era of AI agents just ended.

What Was Actually Released

OpenAI released three things simultaneously: GPT-5.4 (available in ChatGPT as "GPT-5.4 Thinking"), GPT-5.4 Pro, and a ChatGPT for Excel add-in. The base model goes to all ChatGPT users, Plus, and API access. The Pro tier targets "maximum performance on complex tasks" and is priced accordingly.

The model sits in the API, in Codex (OpenAI's autonomous coding agent), and in the main ChatGPT interface - with different features unlocked at each access point. ChatGPT users get the Thinking interface that shows reasoning plans upfront and accepts mid-response adjustments. API and Codex developers get the full computer-use primitives, tool search, and 1-million-token context.

GPT-5.4 merges two previously separate tracks. The GPT-5.3-Codex line focused on coding - it was genuinely excellent at software engineering tasks (56.8% on SWE-Bench Pro). GPT-5.2 handled general reasoning. GPT-5.4 unifies them: it matches GPT-5.3-Codex on coding benchmarks while adding computer-use, stronger visual understanding, and dramatically better factual accuracy across knowledge work.

The Excel add-in is the enterprise Trojan horse. It brings GPT-5.4 directly into the spreadsheet where financial analysts, accountants, and operations teams already live. On internal benchmarks testing spreadsheet modeling tasks "that a junior investment banking analyst might do," GPT-5.4 scores 87.3% versus 68.4% for GPT-5.2. When OpenAI launches something in Excel, it means the deployment at Fortune 500 companies just became a lot easier to justify.

75.0% OSWorld-Verified (desktop navigation)
83.0% GDPval (professional work quality)
92.8% Online-Mind2Web (browser use)
47% Token reduction via tool search

The Computer-Use Leap: Why 75% Matters

Computer screens showing code and interfaces

Computer-use AI operates through screenshots and keyboard/mouse commands - the same interface a human uses. [Photo: Unsplash]

Computer-use is a deceptively hard problem. A human navigating a desktop takes thousands of visual micro-decisions for granted: recognizing a button, estimating where to click, interpreting error messages, knowing when a page has loaded, understanding what an application is doing based on its state.

For an AI, every one of those steps requires visual perception, contextual reasoning about what the software wants, and precise pixel-level action. The model doesn't get an API - it gets a screenshot and has to figure out where to click, just like a human would through a remote desktop session.

Previous attempts at this broke in predictable ways. Models would misidentify UI elements, click in the wrong location, fail to recognize error states, or lose track of what task they were doing mid-workflow. GPT-5.4's 75.0% on OSWorld-Verified means it successfully completes three out of four desktop tasks entirely through screenshot observation and keyboard/mouse control.

The benchmark human score of 72.4% is not measuring an expert user on their own familiar machine. It measures average human performance on unfamiliar desktop tasks. But the direction is clear: for structured, well-defined computer tasks, the AI now operates the machine more reliably than a typical human.

On browser-specific tasks, the numbers are even stronger. WebArena-Verified gives GPT-5.4 a 67.3% success rate using combined DOM and screenshot interaction. Online-Mind2Web - screenshot-only browser navigation - hits 92.8%, a major improvement over ChatGPT Atlas's 70.9% agent mode. Web navigation, which powers a massive share of real-world knowledge work (research, data gathering, form filling, web-based software), is now reliably within reach.

OpenAI has also made the model's behavior steerable by developers, which matters for deployment. Developers can configure custom confirmation policies - controlling what the agent does autonomously versus what it pauses and asks a human before doing. You can set a conservative mode where the AI asks before submitting any forms, or a permissive mode where it runs autonomously through known safe workflows. That kind of configurable trust boundary makes enterprise deployment tractable.

"GPT-5.4 is the first general-purpose model we've released with native, state-of-the-art computer-use capabilities, enabling agents to operate computers and carry out complex workflows across applications." - OpenAI, official release announcement, March 2026

The Benchmark Numbers - A Detailed Breakdown

OpenAI published a full benchmark suite comparing GPT-5.4 against its predecessors. Here's the raw data that matters:

Benchmark GPT-5.4 GPT-5.3-Codex GPT-5.2 What It Measures
GDPval (wins or ties) 83.0% 70.9% 70.9% Professional knowledge work across 44 occupations
SWE-Bench Pro (Public) 57.7% 56.8% 55.6% Real-world software engineering tasks
OSWorld-Verified 75.0% 74.0% 47.3% Desktop navigation via screenshots/keyboard/mouse
Toolathlon 54.6% 51.9% 46.3% Multi-step tool use with real-world APIs
BrowseComp 82.7% 77.3% 65.8% Persistent web browsing for hard-to-find info
MMMU-Pro (visual reasoning) 81.2% - 79.5% Visual understanding and reasoning

The GDPval number is striking. The benchmark tests "well-specified knowledge work" across 44 occupations spanning the top 9 industries by contribution to US GDP. Tasks include sales presentations, accounting spreadsheets, urgent care scheduling, manufacturing diagrams, and short video production. GPT-5.4 matches or exceeds industry professionals in 83% of these comparisons.

The previous versions of GPT-5 were stuck at 70.9% on this measure. GPT-5.4 jumps 12 percentage points - the largest single-version improvement in the series on a benchmark that directly measures professional-grade work quality. That is not a marginal gain.

SWE-Bench Pro measures the ability to fix real bugs and implement features in real open-source software repositories. The jump from 55.6% (GPT-5.2) to 57.7% is smaller, but this benchmark is genuinely hard - the tasks are unseen, real-world issues with complex codebases, not sanitized toy problems. Progress here is meaningful.

BrowseComp deserves special attention because it tests something most AI systems struggle with badly: persistent research. The benchmark tasks require the model to search multiple web sources, synthesize obscure or hard-to-find information, and produce accurate answers on "needle in a haystack" queries. GPT-5.4 Pro specifically reaches 89.3% - meaning OpenAI's highest-tier model now handles the kind of deep-web research that takes a human researcher hours, with high reliability.

On factual accuracy, OpenAI claims GPT-5.4 is 33% less likely to produce false individual claims and 18% less likely to produce responses containing any errors, compared to GPT-5.2. These numbers come from de-identified prompts where real users previously flagged errors - which makes them more credible than internal curated benchmarks. Hallucination reduction has been the single biggest enterprise adoption blocker; these numbers move that needle.

Tool Search: The Engineering Innovation Nobody Is Talking About

Underneath the headline performance numbers, GPT-5.4 introduces a technical architecture change that may have more practical impact than the benchmarks: tool search.

Here's the problem it solves. Modern AI agents work with tools - external capabilities like web search, code execution, database queries, API calls. Each tool has a definition that has to be loaded into the model's context to tell it what the tool does and how to use it. In complex agentic systems, there might be dozens or hundreds of tools. Loading all their definitions into every request burns enormous numbers of tokens, slows everything down, and - critically - can crowd the context window with definitions the model will never use on any given task.

Tool search flips the architecture. Instead of pre-loading all definitions upfront, the model receives a lightweight index of available tools. When it decides it needs a specific tool, it queries the index to load that tool's definition at the moment of use, appending it to the conversation in real time.

OpenAI tested this against the MCP Atlas benchmark with all 36 MCP (Model Context Protocol) servers enabled. Compared to traditional full-definition loading: tool search reduced total token usage by 47% while achieving the same accuracy. In production systems running thousands of agentic requests per day, a 47% token reduction translates directly to cost and latency cuts at scale.

This is the kind of infrastructure improvement that doesn't get headlines but determines which companies can actually afford to run AI agents in production. Enterprise customers with complex tool ecosystems - CRMs, ERPs, internal databases, communication APIs - have been bottlenecked by the cost of context. Tool search removes a significant part of that ceiling.

SECOND-ORDER EFFECT WATCH

Tool search also changes which tool ecosystems become viable. If you only have 50 tools, the pre-loading approach is manageable. With tool search, an agent can theoretically work with thousands of tools without context cost scaling linearly. That shifts the competitive landscape from "which AI has the biggest context window" to "which AI can navigate the largest tool ecosystem." MCP server ecosystems are about to get much larger.

OpenAI also launched a "fast mode" in Codex that delivers up to 1.5x faster token velocity with GPT-5.4. The description - "same model and the same intelligence, just faster" - suggests priority inference routing rather than a model change. Developers can access the same fast speeds via priority processing in the API. For coding workflows where iteration speed is the rate-limiting factor, that 1.5x speedup collapses the feedback loop substantially.

Codex Security: OpenAI's AppSec Agent Goes Public

Cybersecurity code and network visualization

Application security is a bottleneck that AI agents are now targeting directly. [Photo: Unsplash]

Alongside GPT-5.4, OpenAI moved a separate product out of private beta: Codex Security, an AI application security agent. The timing is not coincidental - Codex Security runs on GPT-5.4, and the computer-use and deep reasoning capabilities that define GPT-5.4 are precisely what make autonomous security scanning tractable.

Previously known internally as "Aardvark," Codex Security has been running in private beta since late 2025. The pitch is distinct from generic "AI finds bugs" tools. The agent builds a project-specific threat model first - analyzing the repository to understand what the system does, what it trusts, and where it is exposed. It then searches for vulnerabilities using that model as context, rather than running generic pattern-matching rules against the codebase.

The claimed results from the beta period are notable. OpenAI reports that false positive rates on detections fell by more than 50% across all repositories during the beta period. Noise - findings with over-reported severity - was cut by more than 90%. In one case, noise was reduced by 84% compared to initial rollout scans on the same repository. The signal-to-noise problem has been the primary reason AI security tools haven't gained traction; most generate too many false alerts for security teams to trust.

The scale of the beta is significant too. Over the last 30 days, Codex Security scanned more than 1.2 million commits across external repositories. It identified 792 critical findings and 10,561 high-severity findings. Critical issues appeared in fewer than 0.1% of scanned commits - which is the right number. A tool that finds critical vulnerabilities in every tenth commit isn't finding critical vulnerabilities; it's miscalibrating severity.

"The challenge isn't a lack of vulnerability reports, but too many low-quality ones. Maintainers told us they need fewer false positives and a more sustainable way to surface real security issues without creating additional triage burden." - OpenAI, describing feedback from open-source maintainers in the Codex Security release

Codex Security is now available in research preview for ChatGPT Pro, Enterprise, Business, and Edu customers, with free usage for the first month. That pricing structure - free preview to enterprise customers - is a deliberate adoption tactic. Security tooling decisions in enterprises are sticky; once a team integrates a security scanner into CI/CD pipelines, switching costs are high. OpenAI is buying adoption with a free month.

The open-source angle deserves attention. OpenAI is actively scanning open-source repositories it depends on and sharing high-impact findings with maintainers. This builds goodwill in developer communities, creates a visible track record of real vulnerability discovery, and - not incidentally - generates more training data about what real critical vulnerabilities look like versus noise. The open-source security posture doubles as a product improvement mechanism.

The Strategic Context: Pentagon, Amazon, and Microsoft

GPT-5.4 does not exist in isolation. OpenAI's recent strategic moves form the backdrop for understanding why this release looks the way it does and who it is being built for.

In late February, OpenAI signed an agreement with the US Department of War - the renamed Department of Defense. The scope of that agreement has not been fully disclosed, but BLACKWIRE reported on the controversy surrounding it in March: OpenAI's own researchers and robotics chief Jeffrey Kalinowski resigned over the defense partnership, citing conflicts with the company's original safety mission. GPT-5.4 is explicitly classified as "High cyber capability" under OpenAI's Preparedness Framework - the internal risk rating system that governs how capabilities are assessed before deployment.

The Amazon partnership (announced February 27) matters for GPT-5.4 specifically because it integrates OpenAI's models with Amazon Bedrock, AWS's managed AI service. GPT-5.4's computer-use capabilities running through Bedrock's infrastructure means enterprise customers who have already standardized on AWS get access to these capabilities through their existing cloud contracts. The distribution moat compounds fast when the model runs natively in the cloud platform where most enterprise applications already live.

Microsoft confirmed a continuing partnership the same week. The Microsoft Copilot integration with Anthropic's Claude Cowork (announced March 9) is a notable data point here - it signals that even Microsoft, which has a deep and exclusive-seeming OpenAI relationship, is hedging with Claude for specific long-running, multi-step task scenarios. The competitive pressure this creates likely accelerated GPT-5.4's computer-use prioritization. Anthropic and Google DeepMind have both been aggressive on the agentic AI angle; GPT-5.4 is a direct answer.

COMPETITIVE LANDSCAPE CHECK

Google's Gemini 2.0 Ultra has strong computer-use numbers in controlled benchmarks. Anthropic's Claude 3.7 Sonnet has "extended thinking" for deep reasoning. Neither has shipped a general-purpose model that simultaneously leads on computer-use, knowledge work quality, browser navigation, and code. GPT-5.4 is OpenAI trying to own the full agentic stack in a single release rather than competing at the capability-by-capability level.

A Timeline of the GPT-5 Series

Mid-2025
GPT-5.2 launches
General reasoning improvements. Sets the baseline that GPT-5.4 is measured against. 47.3% on OSWorld-Verified - solid but far below human performance.
Late 2025
GPT-5.3-Codex launches
OpenAI's coding-focused split. 56.8% on SWE-Bench Pro. Excellent for developers but not a general-purpose model. "Aardvark" (Codex Security) begins private beta.
February 27-28, 2026
OpenAI signs DoW agreement and Amazon partnership
Strategic deals that shift OpenAI's deployment distribution. DoW agreement triggers internal resignations including robotics chief Kalinowski.
March 6, 2026
Codex Security research preview launches
Application security agent based on frontier models. Formerly Aardvark. Free month for enterprise customers.
March 10, 2026
GPT-5.4 and GPT-5.4 Pro launch
First general-purpose model with native computer-use. 75% on OSWorld, 83% on GDPval, 1M token context, tool search. ChatGPT for Excel add-in. Fast mode in Codex.

What This Means for Knowledge Workers

The second-order effects of GPT-5.4 are harder to see than the benchmark numbers, but they are more consequential for people who work with computers for a living.

The GDPval result is where to start. Matching or exceeding industry professionals in 83% of comparisons across 44 occupations is not a narrow coding or writing benchmark. It spans accounting, healthcare scheduling, sales, manufacturing, and more. The tasks tested are specific, deliverable-focused professional work: not "write a summary" but "build this accounting spreadsheet" or "create this patient schedule" or "produce this sales presentation." Three out of four of those tasks: the model now does them at professional or better quality.

The implication is not that AI will immediately replace knowledge workers - it is more subtle and more immediate. The tasks that take junior workers most of their time - data formatting, routine analysis, first-draft documents, presentation creation, web research - are the tasks where GPT-5.4 is strongest. This compresses the value of early-career years in knowledge-work fields. The competitive moat of being a new hire who learns to do these tasks efficiently is shrinking.

For senior workers and managers, the change cuts differently. A manager who can deploy a GPT-5.4 agent to handle first-draft analysis, competitive research, and document preparation can effectively expand their productive output without hiring more analysts. This is not speculative - the Excel add-in is designed specifically for this workflow, and OpenAI's partnership with enterprise customers means the procurement conversation is already happening at the CFO level in large corporations.

The computer-use capability specifically unlocks a category of tasks that previous AI models simply could not touch: legacy software. Most enterprise environments run on systems that do not have APIs - old ERP platforms, proprietary internal tools, industry-specific software with no integration pathway. If the AI can operate software through the screen the way a human contractor would, these systems become automatable without any infrastructure modification. The RPA (robotic process automation) industry built a $14 billion market solving exactly this problem with brittle script-based tools. GPT-5.4 is a more flexible, more capable version of what that entire industry has been trying to build for fifteen years.

The 1-million-token context window matters for task length, not just document size. An agent working on a multi-day project - reviewing a 300-page legal brief, monitoring a software deployment, conducting a multi-source research project - needs a context that doesn't reset. 1M tokens is approximately 750,000 words, or 1,500 pages of text. For most professional tasks, that is effectively unlimited within a single session. The agent can remember everything it has done and seen in a work session without summarization loss.

The Preparedness Question: High Cyber Capability

OpenAI's release documentation contains a detail that deserves more attention than it has received: GPT-5.4 is classified as "High cyber capability" under the company's Preparedness Framework. The same classification applied to GPT-5.3-Codex.

OpenAI's Preparedness Framework is the internal risk evaluation system that assesses models for dangerous capabilities across categories including cybersecurity, chemical/biological weapons assistance, and manipulation. A "High" cyber classification means the model's autonomous capabilities in offensive security contexts are at a level that requires careful deployment controls.

The combination of High cyber classification with native computer-use is particularly significant. A model that can operate software autonomously, navigate web interfaces, and find security vulnerabilities - while rating as "High" on cyber capability assessments - is, by definition, a powerful autonomous attack tool if pointed at the wrong targets. OpenAI addresses this through the configurable confirmation policies and safety behavior tuning available to developers. But the controls are developer-configured, not enforced by the model itself.

Codex Security - the application security agent - ships alongside this. The defensive and offensive implications of a highly capable computer-use AI exist on the same spectrum. The same capabilities that let Codex Security find authentication bypasses and SSRF vulnerabilities in friendly repositories could, with different instructions, attempt the same operations against hostile targets. This is not unique to OpenAI - it is the fundamental dual-use problem of AI security research. But GPT-5.4 raises the ceiling on both sides simultaneously.

Security researchers and red-teamers will be testing GPT-5.4's autonomous capabilities aggressively in the weeks ahead. The OSWorld-Verified benchmark tests benign desktop tasks; what the model does when the task involves probing a network or bypassing authentication is a different and open question.

The Race Has a New Leader - For Now

GPT-5.4 is the most capable general-purpose AI agent available as of today. The benchmarks are real, the computer-use capability is production-grade rather than demo-grade, and the infrastructure improvements - tool search, fast mode, expanded vision - address the practical bottlenecks that have blocked enterprise deployment.

The caveats are real too. Benchmark conditions are controlled; the real world has broken login flows, unexpected error states, software that behaves differently between sessions, and multi-step tasks that fail for reasons no benchmark models. 75% success on OSWorld means 25% failure. In a workflow where the AI is acting autonomously, that failure rate needs robust error detection and human-in-the-loop recovery to be safe in production.

But the trajectory is clear. GPT-5.2 sat at 47.3% on OSWorld six months ago. GPT-5.4 is at 75.0% today. If that rate of improvement continues - and there is no structural reason it should stop - the residual 25% failure gap closes in the next two to four model generations. That timeline puts reliable, fully autonomous computer operation in the 2027 window.

Every major AI lab is racing toward the same finish line. Google's Gemini 2.0, Anthropic's Claude 3.7, and now GPT-5.4 are all competing to own the agentic layer - the part that actually does things in the world rather than just answering questions. OpenAI shipped first with a unified, general-purpose solution. The others will answer. The pace of that back-and-forth is now measured in weeks, not quarters.

The agent era is not coming. It arrived on a Tuesday morning with no countdown.

Get BLACKWIRE reports first.

Breaking news, investigations, and analysis - straight to your phone.

Join @blackwirenews on Telegram
← Back to all articles