A thread on r/artificial that has gained significant traction since yesterday is about the GPT-5.4 launch, and even in a community quite accustomed to big numbers, there's a bit more activity than usual. Because the figures here are anything but modest.
OpenAI released the model on March 5th, and it's already available via ChatGPT (as GPT-5.4 Thinking), the API, and Codex. What's getting people talking isn't necessarily the technical architecture — it's the benchmark results against actual professionals.
The GDPval benchmark measures performance on professional tasks across 44 different professions. GPT-5.4 matches or beats industry professionals in 83% of these comparisons. Its predecessor, GPT-5.2, was at 70.9%. That's no small leap.
On OSWorld Verified, which tests the ability to actually control a computer using screenshots, mouse, and keyboard, GPT-5.4 scored 75.0% against humans' 72.4%. It's a small margin, but it's above — and it's the first time OpenAI models have crossed that threshold on that test.
Other figures people are highlighting in the thread: legal document work (BigLaw Bench) scores the model 91%, investment banking spreadsheets 87.3% against GPT-5.2's 68.4%, and agentic web search (BrowseComp) is up to 82.7%. Abstract reasoning on ARC-AGI-2 has jumped from 54.2% to 83.3% for the Pro variant — that's almost 30 percentage points in one generation.
A point that isn't getting as much attention, but should: the new "Tool Search" system cut token consumption by 47% without loss of accuracy. For those running large agentic pipelines, that's potentially quite significant cost savings.
Factual reliability has also improved — individual claims are reportedly 33% less often erroneous, and entire responses 18% less flawed than GPT-5.2. It's difficult to independently verify right now, but it's something to keep an eye on.
Worth noting: these are still early signals from a Reddit community, and benchmarks are always subject to debate about how well they reflect real-world job performance. But the direction is clear, and the pace of development is not something people easily dismiss in these discussions.
We're early here. Mainstream tech journalism will pick this up in a matter of days. Stay tuned.
