AI models all start in the same place: with a training run. The quality and speed of the training infrastructure determines how quickly teams can iterate, what model scale they can handle, and whether jobs complete reliably. With the MLPerf Training 6.0 results now published, it is clear that NVIDIA's Blackwell generation is setting a new industry standard — according to NVIDIA itself and the available benchmark data.
Blackwell Crushes the Competition in MLPerf
MLPerf Training is one of the industry's most widely recognized independent benchmark series for AI training infrastructure. In the latest round, version 6.0, NVIDIA took first place in every category with its Blackwell-based systems, according to NVIDIA's own blog.
The most impressive numbers relate to large language model training. The GB200 NVL72 system — which links 72 Blackwell GPUs together in a rack format — delivered up to 3.2 times faster training on Llama 3.1 405B compared with optimized Hopper solutions (H100) using FP8 precision. According to NVIDIA, the improved performance is largely attributable to the introduction of NVFP4 precision and software optimizations.

GB300 NVL72: Blackwell Ultra Takes It One Step Further
If the GB200 NVL72 is already powerful, the new GB300 NVL72 — dubbed "Blackwell Ultra" — is faster still. According to available benchmark data, the GB300 system delivers up to 1.6 times higher training performance than the GB200 at the same scale. That is a remarkable generational leap within the same architecture family.
The B200 GPU at the heart of the Blackwell lineup is built on a dual-die CoWoS design manufactured on TSMC's 4NP process, featuring 208 billion transistors and 192 GB of HBM3E memory with 8 TB/s of bandwidth. The introduction of native FP4 tensor operations is one of the most significant technical innovations compared with the previous generation.
Real Customers Confirm the Performance
Benchmark figures from NVIDIA naturally warrant a critical eye — the company has obvious commercial interests. It is therefore worth noting that independent users are reporting similar results. Cohere, known for enterprise-focused AI, reportedly achieved three times faster training for its North platform on the GB200 NVL72, according to available sources. Image-generation service Midjourney is said, according to the same source material, to be scaling up a large fleet of Blackwell Ultra GPUs for training upcoming image and video models.
These claims are of course difficult to verify independently, but they suggest that the performance gains are not merely figures on paper.
AMD MI300X: Still Relevant, but Under Pressure
It is important to maintain a nuanced view of the competitive landscape. The AMD Instinct MI300X remains a serious contender, particularly for memory-intensive workloads. With 192 GB of HBM3 memory and 5.3 TB/s of bandwidth, the MI300X is exceptionally well suited to inference of very large models on a single GPU, reducing the need for model sharding and network overhead.
In MLPerf Inference v4.1, the MI300X demonstrated strong performance on Llama 2 70B inference, and AMD has claimed advantages of 20–60 percent over the H100 in certain inference scenarios. For raw large-scale AI training, however, the picture is different: the Blackwell B200 delivers roughly double the raw compute of the H200 across various precision formats.
A key factor is the software stack. AMD's ROCm platform has made significant strides, but is generally considered less mature than NVIDIA's CUDA ecosystem. According to independent analyses, this can result in the MI300X realizing only 37–66 percent of its theoretical capacity in real-world LLM workloads — a substantial limitation that AMD is actively working to reduce.
AMD MI300X is strong on memory-intensive inference, but for pure large-scale AI training, Blackwell is setting a new industry standard that is difficult to compete with today.
What Does This Mean for the AI Training Landscape?
When the fundamental training infrastructure improves by a factor of three from one generation to the next, it changes what is possible to build. Models that previously took weeks to train can now be completed in days. That lowers the threshold for iteration and experimentation — and in practice accelerates the entire AI development cycle.
The MLPerf benchmark is not perfect, and there is always a gap between controlled test conditions and production environments. But as a comparative measure of training infrastructure it is widely recognized in the industry, and Blackwell's dominance here is difficult to ignore.
The sources for this article include NVIDIA's official blog on MLPerf Training 6.0 as well as independent analytical work comparing AMD Instinct MI300X and NVIDIA Blackwell in real-world training scenarios.
