xAI Wipes Out Competitors with Grok Voice API — 5x Faster Than GPT

xAI has quietly launched a full voice API suite that benchmarks above both GPT Realtime and Gemini. The community is buzzing.

◉

24AI Underground

April 18, 2026·Updated April 18, 2026·2 min read

xAI Wipes Out Competitors with Grok Voice API — 5x Faster Than GPT

Community buzz · early signal

SIGNALS

xAI has launched Grok Voice Agent API, TTS, and STT — and the numbers are absurd
0.78 seconds to first audio, xAI claims is almost 5x faster than its closest competitor
A flat price of $0.05 per minute makes this cheaper than OpenAI Realtime for most use cases

Early signal · community sourced · unverified

A discussion currently spreading on Product Hunt concerns something most Norwegian tech professionals have barely noticed: xAI has quietly slipped in a complete voice API package, and it appears to outperform both OpenAI and Google on what truly matters — latency and price.

Let's start from the beginning. The Grok Voice Agent API was released in December 2025, but it's now in April 2026 that it's starting to bubble in community channels. The reason is likely that the TTS and STT APIs were launched as recently as March 16, and people are now beginning to build with the entire stack combined.

What makes this interesting is the architecture. Instead of the classic STT → LLM → TTS pipeline, the Grok Voice Agent API processes audio directly. It sounds like marketing, but the benchmark on Big Bench Audio yields 92.3% — surpassing both Gemini 2.5 Flash Native Audio and GPT Realtime in the reasoning category. That's not an everyday occurrence.

0.78 seconds to first audio. If this holds up in production, it represents a fundamental shift for voice agents.

The pricing model is also worth noting. $0.05 per minute flat for the Voice Agent API. OpenAI Realtime bills per token, which can quickly add up during long conversations. For those building phone bots or customer support agents — which, incidentally, is exactly what xAI itself uses this for via Starlink and Tesla — the math is quite simple.

The TTS API supports inline speech tags, meaning you can program pauses, whispers, sighs, and laughter directly into the text. This is something ElevenLabs has had for a while, but now it's integrated into the same API as the agent layer itself. The STT features speaker diarization and word-level timestamps, and streams via WebSocket.

Why is this worth following now? Because voice agents are where LLM integration truly reaches end-users — not in chatbots, but in phones, cars, and customer service. If Grok Voice truly maintains its latency figures in production, and the price remains as it is, many developers building on OpenAI Realtime will start looking elsewhere.

Important caveat: These are early signals based on community discussions and xAI's own benchmarks. Large-scale independent tests are currently lacking, and self-reported benchmarks should always be taken with a grain of salt. But the buzz is real, and the numbers are not something xAI can hide behind for long — the community will thoroughly test this in the coming weeks.

xAI Wipes Out Competitors with Grok Voice API — 5x Faster Than GPT

Related Articles

Anthropic Released an 84-Page AI Soul — and HN is Exploding

Gas Town Stealing AI Credits from Its Users to Train Itself?

Anthropic Releases Claude Code Routines — AI Agents That Code While You Sleep