Hacker News is buzzing right now. The HN thread about the Claude Opus 4.8 launch has surpassed 870 comments and over 1000 points within hours — that's the kind of engagement you see when something truly strikes a chord in the community.

So what's happening? Anthropic launched Opus 4.8 today, and they are not particularly modest with their claims. According to their own data, the model beats GPT-5.5 on the majority of benchmarks that actually matter in practice: knowledge work, issue-level coding, agentic tool use, and long context windows. GPT-5.5 still holds its ground in terminal and CLI workflows, but otherwise, it looks tough for OpenAI this round.

What's really getting people talking isn't just the raw numbers. SWE-bench Verified at 88.6% is solid, but it's SWE-bench Pro that impresses — up from 64.3% to 69.2%. That's the tougher version of the test, and a jump there is meaningful. Databricks reports that Opus 4.8 provides «a quantum leap in agentic reasoning» within their Genie-dataagent, suggesting this isn't just benchmark gaming.

Anthropic states the model is four times less prone to letting code errors pass unnoticed — that's the kind of reliability improvement that truly matters in production.

On the pricing front, things are also happening. The base price is unchanged from Opus 4.7 ($5 per million input-tokens, $25 output), but the new Fast mode at $10/$50 per million tokens offers 2.5x speed and is three times cheaper than the equivalent fast mode in the previous generation. The context window is one million tokens with 128K max output — that's generous.

Anthropic releases Opus 4.8 — beats GPT-5.5 on 12 benchmarks - Bilde 1

The HN discussion is, as expected, divided. Some are enthusiastic about the reliability improvements, highlighting that Anthropic compares Opus 4.8 to its best alignment model (Claude Mythos Preview) regarding misaligned behavior rates. Others are more skeptical of Anthropic's own benchmarks and are awaiting independent testing.

Worth noting: these are early signals based on community discussions and Anthropic's own release notes. Independent, systematic evaluations take time, and history shows that official benchmark figures don't always hold up in practice.

Nevertheless — with the buzz score this thread is generating and the concrete technical details already circulating, this is definitely something to follow closely in the coming days.