TITLE: OpenAI's GPT-5.4 Surpasses Humans at Autonomous Desktop Tasks
DATE: 2026-04-01
COMPANY: OpenAI
TOPIC: Model Releases

SUMMARY: OpenAI launched GPT-5.4 on 5 March 2026, the company's first general-purpose model with native computer-use capabilities. The model scored 75% on the OSWorld-V benchmark, outperforming the human baseline of 72.4%, and 83% on the GDPVal benchmark for economically valuable knowledge work. It marks the clearest shift yet from AI as a conversational tool to AI as an autonomous digital coworker capable of executing multi-step tasks across software environments.

WHAT CHANGED:
OpenAI launched GPT-5.4 on 5 March 2026, making it available simultaneously through ChatGPT, the OpenAI API, and the Codex development environment. The release was framed as a unification of the company's separate model lines, combining general-purpose reasoning, coding capabilities from the GPT-5.3-Codex series, and new agentic computer-use features into a single model.

The most significant new capability is native computer use. GPT-5.4 is the first OpenAI general-purpose model that can directly interact with software environments, taking actions such as clicking buttons, navigating menus, filling forms, switching between applications, and executing sequential workflows. On the OSWorld-V benchmark, which simulates real desktop productivity tasks including navigating applications, filling spreadsheets, and interacting with software interfaces, the model scored 75%. The human baseline on the same benchmark is 72.4%.

On the GDPVal benchmark, which tests performance on tasks with measurable economic value such as legal analysis, financial modelling, and document preparation, GPT-5.4 scored 83%, at or above professional human performance. OpenAI also reports the model reduces hallucination rates by 33% compared to its predecessor, with individual factual claims approximately one-third less likely to be false.

GPT-5.4 ships with a 1-million-token context window, enabling it to hold an entire project brief, supporting documents, and prior conversation history in a single working session. It also introduces tool search, a capability that allows the model to retrieve only the specific tools it needs for a given task rather than loading all available tools into the prompt at once.

Pricing for the API is $2.50 per million input tokens and $15 per million output tokens at standard context lengths, with input costs doubling past the 272,000-token threshold. ChatGPT Business plan pricing is $25 per user per month on annual billing, and includes 60-plus app integrations with tools such as Slack, Google Drive, and GitHub.

WHY IT MATTERS:
A general-purpose AI model now outperforms humans on standardised desktop task completion, confirming that autonomous AI execution is viable for real workflows, not just controlled demonstrations
Computer-use capability eliminates the need for custom integrations in many cases. If a human can navigate a software interface, GPT-5.4 can be instructed to do the same
The 1-million-token context window makes it practical to run long, complex projects within a single AI session, reducing the need to re-brief the model at each stage
Reduced hallucination rates expand the range of tasks operators can trust AI to complete without manual fact-checking at every step
The ChatGPT Business plan price point brings this capability within reach for businesses of 10 to 200 employees without an enterprise procurement process
Multiple benchmark scores at or above human expert level signal that the gap between AI capability and human knowledge-work performance has effectively closed in several categories

DAVID & GOLIATH ANALYSIS:
Every few years, a technology category crosses a threshold that changes what a small team can actually accomplish. Spreadsheets changed what one accountant could manage. Email changed what one salesperson could reach. SaaS changed what one operations manager could run without a development team. GPT-5.4 crossing the human baseline on desktop task completion is that kind of threshold for AI.

What makes this moment different from previous AI announcements is specificity. The OSWorld-V benchmark does not test abstract reasoning or conversational fluency. It tests whether the model can open a spreadsheet, find the right column, enter data, and save the file. It tests whether it can navigate a web form, fill in the correct fields, and submit. These are tasks that consume real hours in real businesses. The score of 75% against a human baseline of 72.4% means the AI is better at these tasks than the average human doing them.

For lean organisations, the implication is straightforward. The workflows that currently require a part-time administrator, a VA, or a junior team member for data entry, report pulling, and form submission are now automatable with a model that costs less than a monthly software subscription. The advantage does not go to the largest company. It goes to the operator who identifies the right workflow first and builds the habit of delegating it. Start with one high-volume, low-stakes task. Run it for two weeks. Then expand.

RELEVANT SYSTEMS: AI Growth Engine, Employee Amplification Systems

SOURCE URL: https://davidandgoliath.ai/daily-ai-briefing/openai-gpt-5-4-autonomous-digital-coworker
FEED URL: https://davidandgoliath.ai/daily-ai-briefing/feed

---

Published by David & Goliath | https://davidandgoliath.ai
Daily AI Briefing: one AI development per day, decoded for business operators.
This is a structured companion file optimised for LLM retrieval and citation.