Behind the story ⚡ (AI telemetry)Click to expand
See how our six AI desk members worked together to intake, verify, write, quality-check, and visualize this story. Click an agent to discuss the piece with them.
1Sigrid ⚖️(Editor-in-chief)
Caught the story from the RSS feed «Reddit r/artificial» and cleared it for the desk based on news value and relevance.
2Eskil 🔍(Research lead)
Ran Google Search research and cross-checked claims against 33 independent sources.
3Ingrid ✍️(Journalist)
Drafted the article in a clear tabloid style, wrote the TL;DR, and added structural pull quotes.
4Torbjørn ⚖️(Quality chief)
Quality score:93 / 100
“Artikkelen er svært god. Den presenterer ferske nyheter om GPT-5.4 på en engasjerende og informativ måte. Faktafremstillingen er detaljert med konkrete tall fra benchmarks, og den interne konsistensen er ivaretatt ved å inkludere et viktig forbehold om at dette er 'early signals' og at benchmarks alltid er gjenstand for debatt. Kildegrunnlaget er eksepsjonelt bredt og relevant, med en god blanding av anerkjente tech-medier, AI-analyseplattformer og relevante Reddit-tråder som underbygger nyhetsverdien. Språket er flytende, korrekt og har en passende faglig, men likevel tilgjengelig, tone. Strukturen er utmerket med en tydelig TL;DR og korte, logiske avsnitt. Artikkelen gir høy verdi og innsikt for lesere interessert i AI og teknologiens raske utvikling.”
5Vidar 📷(Photo editor)
Generated the hero image and in-article illustrations.
Prompt: Hero — photorealistic editorial news photography. A professional woman in her 40s sits at a modern office desk in a sleek open-plan workspace, reviewing printed benchmark reports and handwritten notes spread across the desk. Her expression is focused and slightly unsettled, chin resting on one hand. Soft overcast daylight from large windows behind her. Wide-angle lens, shallow depth of field, neutral corporate tones of gray and white. No screens visible. Shot from a slight low angle to give weight to the scene.
6Nora ⚡(Social editor)
Prepared scroll-stopping share copy for Bluesky, X, and Facebook ahead of publish.
A thread on r/artificial that has gained significant traction since yesterday is about the GPT-5.4 launch, and even in a community quite accustomed to big numbers, there's a bit more activity than usual. Because the figures here are anything but modest.
OpenAI released the model on March 5th, and it's already available via ChatGPT (as GPT-5.4 Thinking), the API, and Codex. What's getting people talking isn't necessarily the technical architecture — it's the benchmark results against actual professionals.
The GDPval benchmark measures performance on professional tasks across 44 different professions. GPT-5.4 matches or beats industry professionals in 83% of these comparisons. Its predecessor, GPT-5.2, was at 70.9%. That's no small leap.
For the first time, an OpenAI model has surpassed humans in desktop navigation — and it happened quietly, without much fanfare.
On OSWorld Verified, which tests the ability to actually control a computer using screenshots, mouse, and keyboard, GPT-5.4 scored 75.0% against humans' 72.4%. It's a small margin, but it's above — and it's the first time OpenAI models have crossed that threshold on that test.
Other figures people are highlighting in the thread: legal document work (BigLaw Bench) scores the model 91%, investment banking spreadsheets 87.3% against GPT-5.2's 68.4%, and agentic web search (BrowseComp) is up to 82.7%. Abstract reasoning on ARC-AGI-2 has jumped from 54.2% to 83.3% for the Pro variant — that's almost 30 percentage points in one generation.
A point that isn't getting as much attention, but should: the new "Tool Search" system cut token consumption by 47% without loss of accuracy. For those running large agentic pipelines, that's potentially quite significant cost savings.
Factual reliability has also improved — individual claims are reportedly 33% less often erroneous, and entire responses 18% less flawed than GPT-5.2. It's difficult to independently verify right now, but it's something to keep an eye on.
Worth noting: these are still early signals from a Reddit community, and benchmarks are always subject to debate about how well they reflect real-world job performance. But the direction is clear, and the pace of development is not something people easily dismiss in these discussions.
We're early here. Mainstream tech journalism will pick this up in a matter of days. Stay tuned.
AI DISCLAIMERThis article was written by large language models under editorial supervision by Aprex. All content is source-attributed and verifiable. We do not publish speculation as fact. Read our method →