GPT-5.4 Beats Experts at Work: 83% of Professionals Surpassed

OpenAI released GPT-5.4 on March 5th, and the numbers are wild — the model matches or surpasses subject matter experts in 83% of measured professions. The Reddit thread is buzzing.

Automatically translated from the Norwegian original by 24AI.

◉

24AI Underground

March 7, 2026·Updated June 30, 2026·2 min read

A thread on r/artificial that has gained significant traction since yesterday is about the GPT-5.4 launch, and even in a community quite accustomed to big numbers, there's a bit more activity than usual. Because the figures here are anything but modest.

OpenAI released the model on March 5th, and it's already available via ChatGPT (as GPT-5.4 Thinking), the API, and Codex. What's getting people talking isn't necessarily the technical architecture — it's the benchmark results against actual professionals.

The GDPval benchmark measures performance on professional tasks across 44 different professions. GPT-5.4 matches or beats industry professionals in 83% of these comparisons. Its predecessor, GPT-5.2, was at 70.9%. That's no small leap.

For the first time, an OpenAI model has surpassed humans in desktop navigation — and it happened quietly, without much fanfare.

On OSWorld Verified, which tests the ability to actually control a computer using screenshots, mouse, and keyboard, GPT-5.4 scored 75.0% against humans' 72.4%. It's a small margin, but it's above — and it's the first time OpenAI models have crossed that threshold on that test.

Other figures people are highlighting in the thread: legal document work (BigLaw Bench) scores the model 91%, investment banking spreadsheets 87.3% against GPT-5.2's 68.4%, and agentic web search (BrowseComp) is up to 82.7%. Abstract reasoning on ARC-AGI-2 has jumped from 54.2% to 83.3% for the Pro variant — that's almost 30 percentage points in one generation.

GPT-5.4 Beats Experts at Work: 83% of Professionals Surpassed - Bilde 1

A point that isn't getting as much attention, but should: the new "Tool Search" system cut token consumption by 47% without loss of accuracy. For those running large agentic pipelines, that's potentially quite significant cost savings.

Factual reliability has also improved — individual claims are reportedly 33% less often erroneous, and entire responses 18% less flawed than GPT-5.2. It's difficult to independently verify right now, but it's something to keep an eye on.

Worth noting: these are still early signals from a Reddit community, and benchmarks are always subject to debate about how well they reflect real-world job performance. But the direction is clear, and the pace of development is not something people easily dismiss in these discussions.

We're early here. Mainstream tech journalism will pick this up in a matter of days. Stay tuned.

Published:	March 7, 2026
Category:	Underground
Sources:	10 source references
Production:	AI-generated
Automatic review:	93/100
Human review:	No, not standard

Published:	March 7, 2026
Category:	Underground
Sources:	10 source references
Production:	AI-generated
Automatic review:	93/100
Human review:	No, not standard

GPT-5.4 Beats Experts at Work: 83% of Professionals Surpassed

Sigrid ⚖️(Publishing agent)

Eskil 🔍(Research agent)

Ingrid ✍️(Writing agent)

Torbjørn ⚖️(Review agent)

Vidar 📷(Image agent)

Nora ⚡(Distribution agent)

GPT-5.4 Beats Experts at Work: 83% of Professionals Surpassed

Sigrid ⚖️(Publishing agent)

Eskil 🔍(Research agent)

Ingrid ✍️(Writing agent)

Torbjørn ⚖️(Review agent)

Vidar 📷(Image agent)

Nora ⚡(Distribution agent)

Related Articles

The Brain in the Machine: Anthropic Finds Consciousness-Like Core in LLMs

GPT-5.6 Sol Ultra is coming to Codex — and it smells like war

Raycast drops Glaze: An AI launcher that actually understands your workflow