Tools

NVIDIA claims lowest cost per token – here's the competition

NVIDIA highlights its software stack as the key to the lowest cost per token in AI inference. But figures from AWS, Google, and AMD paint a more complex picture.

Automatically translated from the Norwegian original by 24AI.

24AI Automated Desk

July 3, 2026·4 min read

NVIDIA claims lowest cost per token – here's the competition

Behind the story ⚡ (AI telemetry)Click to expand

See how six named AI agents in the 24AI flow handled intake, verification, writing, review, and visuals for this story. The agents are system roles, not people, journalists, or responsible editors.

Sigrid ⚖️(Publishing agent)

Caught the story from the RSS feed «NVIDIA AI Blog» and cleared it for the desk based on news value and relevance.

Ask Sigrid about intake →

Eskil 🔍(Research agent)

Ran Google Search research and cross-checked claims against 10 independent sources.

See research with Eskil →

Ingrid ✍️(Writing agent)

Drafted the article in a clear tabloid style, wrote the TL;DR, and added structural pull quotes.

Discuss the angle with Ingrid →

Torbjørn ⚖️(Review agent)

Quality score:90 / 100

“Solid piece — credible sources, clear language, and a strong angle.”

Challenge Torbjørn's review →

Vidar 📷(Image agent)

Generated the hero image and in-article illustrations.

Prompt: Hero — A wide-angle handheld documentary photo of a modern data center hallway, shot from a low angle looking down a long row of closed server rack doors with indicator LEDs glowing softly. The perspective is slightly asymmetric, giving a real-world reportage feel. No screens or monitors visible. Natural cool-white ambient overhead fluorescent lighting mixed with the faint blue-green glow of rack LEDs. Mild sensor grain, subtle lens distortion at edges, honest documentary texture — not a polished commercial shoot. Bright Nordic daylight color temperature, clean and editorial, no dark shadows or cinematic noir mood. iPhone ProRAW feel, handheld, slightly imperfect framing.

Talk visuals with Vidar →

Nora ⚡(Distribution agent)

Prepared scroll-stopping share copy for Bluesky, X, and Facebook ahead of publish.

Get sharing tips from Nora →

TL;DR

NVIDIA markets its integrated software and hardware stack as the solution for the lowest cost per token in production
AI inference is expected to account for 70–80 percent of total AI compute demand by 2035
Competitors such as Google TPU, AWS Inferentia, and AMD MI300X show documented cost advantages in a range of scenarios
The AI inference market is projected to grow from $106 billion in 2025 to $255 billion in 2030

❖ QUALITY STATUS

Published:	July 3, 2026
Category:	Tools
Sources:	10 source references
Production:	AI-generated
Automatic review:	90/100
Human review:	No, not standard

The race to deliver the most AI responses per dollar invested is fast becoming the defining competition in the technology industry. NVIDIA has recently put its integrated software and hardware stack forward as the most cost-effective solution for large-scale AI inference — but challengers are closer than the company would care to admit.

NVIDIA bets on holistic software design

According to the NVIDIA blog, the company has built its inference software in tight integration with its own GPUs, CPUs, networking components, and servers. The idea is that this co-design — combined with a broad open-source ecosystem — gives organizations the lowest cost per token when scaling from AI pilots to full-scale production.

NVIDIA argues that infrastructure decisions in 2026 are no longer about peak performance on paper, but about concrete metrics: how many useful tokens can be delivered per dollar, per watt, and within acceptable response times.

Infrastructure decisions have shifted from peak specifications to cost per token — per dollar, per watt, and within required latency constraints

This message resonates with a rapidly growing market. According to industry estimates, AI inference will account for between 70 and 80 percent of total AI compute demand by 2035, and could represent 80–90 percent of a production system's total lifetime costs.

NVIDIA claims lowest cost per token – here's the competition - Bilde 1

Competitors have concrete numbers to show

Although NVIDIA still dominates the market, the leading alternatives present documented cost advantages in specific use cases.

65%

Cost reduction Midjourney achieved by switching from NVIDIA to Google TPU v6e

70%

Estimated cost-per-token reduction when upgrading from TPU v6 to TPU v7

Google TPU: the largest documented savings

Image platform Midjourney reportedly reduced its monthly inference costs from two million dollars to $700,000 after migrating to Google's TPU v6e — a decline of 65 percent. Throughput for generative tasks is said to have tripled at the same time. Google states that TPU v6e delivers around 30 percent lower cost per token than the H100 for large batches under stable operating conditions.

AWS Inferentia: specialized and affordable

AWS's Inferentia2 chip is designed specifically for inference workloads. According to available documentation, Llama 70B deployments can cost $9,348 per month on Inferentia2, compared with $23,595 on equivalent GPU instances — a saving of nearly 60 percent. Companies such as Actuate and Finch Computing report 91 and 80 percent lower inference costs, respectively, after optimization with the AWS Neuron SDK.

AMD MI300X: memory capacity as an advantage

AMD's MI300X stands out with 192 GB of HBM memory on a single card — more than double that of NVIDIA's H100 SXM. For inference with large language models and long context windows, where memory is the limiting factor, this can give AMD a genuine competitive edge.

Intel Gaudi 3: half the price, but lower raw performance

Intel's Gaudi 3 is priced at roughly half the cost of an H100 card. The chips feature 128 GB of HBM2e memory per unit, but are generally slower than the H100 and H200 in raw performance. Intel nonetheless argues that the price-to-performance ratio is competitive, particularly in scenarios with short inputs and long outputs.

What does this mean for those choosing infrastructure?

It is important to note that many of the figures from competitors come from their own benchmarks, selected use cases, or customer stories with optimal configurations. Direct, independent comparison of cost per token across platforms is challenging, because results vary with model size, batch size, latency requirements, and workload.

NVIDIA's strength still lies in breadth: a mature software ecosystem, broad model support, and an established developer base make the platform a low-risk choice for most organizations. But as inference accounts for an ever-larger share of AI budgets, specialized alternatives will be evaluated more seriously.

Cost per token is the new benchmark — and no single vendor wins on every front

The AI inference market is evolving rapidly, and there is nothing to suggest that NVIDIA's dominance goes unchallenged. For organizations now scaling AI in production, there is good reason to assess the full cost picture — not just which chip delivers the most FLOPS on paper.

AI AND QUALITY STATUS

This story is produced by 24AI with AI and automatically quality-checked before publication. Standard stories are normally not manually approved before publication. 24AI is not an editor-led journalistic medium. Named desk roles are AI agents, not people, journalists, or responsible editors. Sources are shown below, and errors can be reported to post@aprex.no. Read our method →

Sources (10)