OpenAI announced this week GPT-5.4, the latest iteration in the GPT-5 series. According to the company, this is the most powerful and efficient model they have launched for professional use, with particular emphasis on coding, tool use, and so-called “computer use” — the ability to operate a computer autonomously.

Can take control of your entire PC

The most notable feature of GPT-5.4 is its built-in computer control. The model can take screenshots, use the mouse and keyboard, and navigate applications and websites — all without requiring a separate specialized model for the task, according to OpenAI's own descriptions.

This makes GPT-5.4 a strong candidate for developing autonomous agents that can perform complex work tasks over time, without human intervention for each individual step.

The model is available to subscribers of ChatGPT Plus, Team, and Pro, as well as through the Codex platform and OpenAI's developer API.

GPT-5.4 launched: OpenAI's new top model takes full control of your PC

What do the benchmarks say?

It is worth noting that many of the available benchmark figures stem from GPT-5.2, and that independent comparisons of GPT-5.4 are yet limited. OpenAI has not yet published a complete benchmark set for the new model.

What we know from GPT-5.2 measurements still provides a picture of the level: On the AIME 2025 mathematics benchmark, GPT-5.2 reached 100 percent without external tools, and on the SWE-bench Verified coding benchmark, the Codex variant scored 80.0 percent — according to available research data.

Benchmarks show a close three-way race — and OpenAI does not lead on several critical metrics
GPT-5.4 launched: OpenAI's new top model takes full control of your PC

Claude and Gemini are not intimidated

Competitors do not seem to be standing still. Anthropic's Claude Opus 4.6 scores 80.8 percent on SWE-bench Verified — marginally above GPT-5.2 — and has shown strong results on terminal-based coding tasks with 65.4 percent on Terminal-bench 2.0. According to available comparison data, many developers highlight that Claude is better at interpreting vague instructions and sticking to the plan on long agent tasks.

Google's Gemini 3.1 Pro impresses particularly on abstract reasoning, with 77.1 percent on ARC-AGI-2 — markedly higher than Claude Opus 4.6 (68.8 percent) and GPT-5.2 (52.9 percent). On PhD-level scientific reasoning (GPQA Diamond), Gemini 3.1 Pro scores 94.3 percent, against Claude's 87 percent.

77.1%
Gemini 3.1 Pro on ARC-AGI-2
80.8%
Claude Opus 4.6 on SWE-bench Verified

Three distinct profiles for three different needs

Based on available data, a picture emerges of three models with different strengths:

GPT-5.4

Aimed at professional workflows with built-in computer control and strong integration with OpenAI's own tool ecosystem. Suitable for companies looking to build autonomous agents.

Claude Opus 4.6

Excels in complex coding, long-term tasks, and situations where the model must interpret unclear instructions. Preferred by many in developer communities for agent-based work.

Gemini 3.1 Pro

Strongest in multimodal tasks — text, image, audio, and video — as well as abstract and scientific reasoning. Also has the largest context window among the three, with two million tokens on the roadmap.

A critical look at the sources

It is important to emphasize that the figures in this article are derived from a combination of OpenAI's own communication and compiled research data, and that different benchmarks have been run on different model versions. GPT-5.4 is so new that direct comparative data against Claude Opus 4.6 and Gemini 3.1 Pro across identical tests is not yet available from independent actors. Benchmark figures from AI companies themselves should be read with caution.