NVIDIA Vera Rubin Ships This Fall: 8 Cloud Partners, 10x Lower Token Cost, HBM4 Triples Bandwidth
15 hour ago / Read about 36 minute
Source:TechTimes

Nvidia.com

Production shipments of NVIDIA's Vera Rubin AI platform are scheduled to begin this fall at all eight confirmed cloud partners — AWS, Google Cloud, Microsoft Azure, Oracle Cloud, CoreWeave, Lambda, Nebius, and Nscale — marking the moment when the compute architecture that now powers U.S. national laboratories and European supercomputing centers becomes available to the cloud developers and AI labs that have been waiting for it since January. The platform's 10x reduction in inference token cost is not driven by raw compute gains alone; it comes from an architectural decision to triple per-GPU memory bandwidth to 22 terabytes per second using HBM4 and double the rack-scale interconnect to 260 terabytes per second using NVLink 6, together eliminating the memory bottleneck that has constrained trillion-parameter model efficiency on current Blackwell systems.

NVIDIA confirmed that the platform entered full production on June 1, 2026, at its GTC Taipei keynote. On June 22, the company announced at ISC High Performance 2026 in Hamburg that Vera Rubin will also power the next-generation supercomputers at Leibniz Supercomputing Centre, the U.S. Department of Energy's National Energy Research Scientific Computing Center, and Los Alamos National Laboratory, expanding its deployment footprint from commercial AI factories to national security and open-science workloads in a single week.

Read more: NVIDIA Vera Rubin NVL72 Cloud Rollout Expands to Europe as H2 Deployments Near

Why the Memory Wall Was the Real Problem

The framing of AI hardware progress around compute — exaflops, petaflops, raw throughput numbers — has consistently understated where the actual constraint lives in large model training and inference. When a 72-GPU rack tries to run a trillion-parameter mixture-of-experts model, the GPUs are not idle waiting for arithmetic to complete. They are idle waiting for data to arrive from memory.

The previous Blackwell platform used HBM3e memory at 8 terabytes per second per GPU. Vera Rubin replaces that with HBM4, which doubles the interface width from 1,024 bits to 2,048 bits and delivers 22 terabytes per second per GPU — a near-tripling of memory throughput. Each Rubin GPU carries 288 gigabytes of this faster memory. Simultaneously, NVLink 6 doubles the GPU-to-GPU interconnect from the 1.8 terabytes per second offered by NVLink 5 in Blackwell to 3.6 terabytes per second per GPU bidirectional, bringing total all-to-all fabric bandwidth across all 72 GPUs in a single NVL72 rack to 260 terabytes per second.

The practical consequence of these two changes together is that mixture-of-experts models can be trained on the Vera Rubin NVL72 using one-quarter the number of GPUs required on an equivalent Blackwell system. For frontier model labs whose training budgets are constrained by GPU allocation rather than by time, that is the figure that changes the economics most significantly. For inference, NVIDIA reports 10x higher throughput per watt and a 10x reduction in cost per million tokens compared to the Blackwell generation.

What the Vera Rubin NVL72 Actually Is

The NVL72 — the rack-scale unit at the center of most partner deployments — integrates 72 Rubin GPUs and 36 custom Vera CPUs in a single fully liquid-cooled enclosure. NVIDIA designed both chips in-house, a first for the company's data center platforms.

The Rubin GPU carries 336 billion transistors manufactured on TSMC's 3-nanometer process and delivers 50 petaflops of NVFP4 inference performance per card. The Vera CPU uses 88 custom ARM-based Olympus cores and connects to the Rubin GPU over a chip-to-chip NVLink-C2C link running at 1.8 terabytes per second. That chip-to-chip connection places both under a single coherent memory fabric for the first time in a commercial rack-scale system: data no longer needs to cross a PCIe boundary between CPU preprocessing and GPU compute, which removes a latency bottleneck that has constrained agentic AI workloads — tasks where a single user prompt triggers hundreds of reasoning and tool-use steps before a response is returned.

The NVL72 rack requires 100 percent liquid cooling. Air-cooled configurations do not exist for this generation. Data centers deploying Vera Rubin must support direct-to-chip liquid cooling infrastructure, an 800-volt DC power architecture that represents a significant departure from the 48-volt standard in place for over a decade, and electrical plant upgrades that carry retrofit costs of roughly $60,000 to $195,000 per rack before the rack itself is purchased.

The platform also now comprises seven chips — the original six announced at CES 2026 in January, plus the Groq 3 Language Processing Unit added at GTC in March. The Groq 3 LPX rack handles low-latency deterministic inference, and when paired with NVL72 racks it delivers a 35x improvement in inference throughput per megawatt for trillion-parameter models, according to NVIDIA.

Eight Cloud Partners, One Deployment Window

NVIDIA confirmed at CES 2026 that the first deployment cohort for Vera Rubin includes all four major public cloud providers — AWS, Google Cloud, Microsoft Azure, and Oracle Cloud Infrastructure — alongside specialist AI cloud providers CoreWeave, Lambda, Nebius, and Nscale.

The coordinated deployment across all eight partners in a single half-year window stands apart from the rollout patterns of prior GPU generations, when staggered availability left some providers waiting quarters behind others.

Microsoft's commitment carries the largest disclosed footprint. The company has confirmed it will deploy Vera Rubin NVL72 rack-scale systems at its next-generation Fairwater AI superfactory sites, which will scale to hundreds of thousands of Vera Rubin Superchips. Google Cloud is offering Vera Rubin through its A5X bare-metal instances and connecting those instances through Google's proprietary Virgo Network — a purpose-built AI networking fabric that can link up to 80,000 Rubin GPUs in a single data center and up to 960,000 GPUs across multiple sites. CoreWeave became the first cloud provider to complete full rack-scale validation of the Vera Rubin NVL72 on June 1, 2026, when Dell Technologies delivered the system and CoreWeave completed a 147-hour testing suite before declaring production validation complete.

AWS has committed to deploying more than one million NVIDIA GPUs — including Blackwell and Rubin architectures — within the next 12 months.

Read more: Nvidia Vera Rubin Enters Full Production: Samsung, SK Hynix, Micron Named HBM4 Suppliers

Who Pays $7.8 Million a Rack — and Why

A Morgan Stanley research report released May 22 estimated the bill-of-materials cost for a single VR200 NVL72 rack at approximately $7.8 million, nearly double the roughly $4 million estimate for the prior Blackwell NVL72 generation. The increase does not track primarily to GPU pricing: Rubin GPUs are estimated at approximately $55,000 per unit in hyperscaler volume, while memory components — HBM4 and LPDDR5X — now account for approximately $2 million of the total rack cost, or roughly 26 percent of the bill of materials.

That figure represents a 435 percent increase in memory cost compared to the Blackwell generation, and it reflects a structural shift in AI infrastructure economics: for the first time, the memory supply chain — not the GPU — is the dominant cost driver in AI server pricing. It is also the physical manifestation of the architectural thesis behind Vera Rubin: if memory bandwidth is the binding constraint on model performance, the system needs far more of it, and that memory carries a premium.

For organizations running sustained AI inference workloads around the clock, the 10x reduction in token cost from Vera Rubin's efficiency improvements is designed to justify that upfront cost. NVIDIA founder and CEO Jensen Huang framed it at GTC Taipei as "token factory economics": every data center is now an AI factory whose output is measured in tokens per watt, and the infrastructure inside that power envelope determines its revenue potential.

How Vera Rubin Stands Against Competition

Vera Rubin arrives as competition in AI accelerator hardware has grown more substantial than at any prior point in NVIDIA's data center history. AMD's Helios rack-scale systems, built on the MI450X architecture, promise inference performance comparable to the Vera Rubin NVL72 while offering 432 gigabytes of HBM4 memory per GPU socket — 50 percent more capacity per GPU than Rubin's 288 gigabytes, which could allow AMD-based systems to serve larger models on a single rack. Google continues expanding its TPU family, which now splits into two purpose-built chips — the training-focused TPU 8t and inference-focused TPU 8i — giving it an in-house alternative for workloads where it chooses not to rely on NVIDIA silicon.

But the breadth of the Vera Rubin partner ecosystem — covering all four major public clouds and four specialist AI providers simultaneously — reflects a commercial reality that NVIDIA's competitors have not yet matched: the combination of CUDA's software ecosystem, NVIDIA's supply chain coordination across more than 350 factories in 30 countries, and relationships with every major hyperscaler create adoption momentum that raw GPU performance numbers do not fully capture.

One constraint that does not appear in the competitive narrative: NVIDIA acknowledged in its FY2026 annual report that the company was effectively foreclosed from competing in China's data center market after successive rounds of U.S. export controls. The Vera Rubin NVL72, with performance roughly 22 times higher than the chips NVIDIA is currently permitted to sell in China, is not available to Chinese buyers under the current regulatory framework.

When Enterprise Developers Can Actually Get Access

The H2 2026 deployment cohort is committed by all eight named providers, but the practical access window for enterprise development teams without hyperscaler-scale contracts is later. Manish Rawat, semiconductor analyst at TechInsights, said supply constraints will tighten cloud availability and elevate the importance of reserved capacity, with enterprises likely facing delays in the availability of next-generation instances beyond what the hyperscaler timeline suggests.

Supply constraints operate on two fronts. TSMC's 3-nanometer process — which also supports Apple's latest processors and AMD's MI450X — carries finite wafer capacity, with NVIDIA's estimated 2026 output of Rubin GPUs running between 200,000 and 300,000 units, of which NVIDIA historically allocates 60 to 70 percent to hyperscalers in the first year. The HBM4 supply chain presents a second constraint: each Rubin GPU requires 288 gigabytes of HBM4, roughly six times the memory per device compared to consumer GPUs, and HBM4 yields at TSMC remain below the mature HBM3e levels that supported Blackwell at scale.

For most enterprise AI teams, the practical window for Vera Rubin access aligns with 2027, consistent with the six-to-twelve-month ramp that has characterized each recent NVIDIA GPU generation. Teams running models below 70 billion parameters on existing Blackwell capacity have no urgent reason to upgrade in 2026. Vera Rubin's efficiency gains are most economically significant for models above 200 billion parameters, disaggregated inference at scale, and workloads where per-token compute cost is the primary constraint.


Frequently Asked Questions

What is the NVIDIA Vera Rubin platform?

Vera Rubin is NVIDIA's next-generation AI computing platform, succeeding the Blackwell architecture. It integrates seven co-designed chips — led by the Rubin GPU and Vera CPU — into a rack-scale supercomputer. The NVL72 configuration combines 72 Rubin GPUs and 36 Vera CPUs in a single liquid-cooled rack that delivers 3.6 exaflops of NVFP4 inference performance. NVIDIA named the platform after Vera Florence Cooper Rubin, the American astronomer whose work on galaxy rotation curves provided the first major observational evidence for dark matter.

How does Vera Rubin compare to Blackwell?

Three architecture changes account for most of the performance gap. First, HBM4 memory nearly triples per-GPU memory bandwidth from 8 terabytes per second on Blackwell to 22 terabytes per second. Second, NVLink 6 doubles the GPU-to-GPU interconnect from 1.8 to 3.6 terabytes per second per GPU, bringing total all-to-all rack bandwidth to 260 terabytes per second versus 130 on Blackwell. Third, the Vera CPU connects to the Rubin GPU over a chip-to-chip NVLink link at 1.8 terabytes per second, eliminating the PCIe boundary that previously separated CPU and GPU memory domains. Together, these changes allow the NVL72 to train large mixture-of-experts models with one-quarter the GPU count required on Blackwell and to deliver inference at one-tenth the cost per million tokens.

When will Vera Rubin be available in the cloud?

Production shipments are scheduled to begin this fall at the eight confirmed cloud partners: AWS, Google Cloud, Microsoft Azure, Oracle Cloud, CoreWeave, Lambda, Nebius, and Nscale. CoreWeave completed the first full rack-scale validation on June 1, 2026. Enterprise teams without hyperscaler-scale contracts should expect practical access in 2027, consistent with the supply ramp pattern that applied to the previous two NVIDIA GPU generations.

Why does memory bandwidth matter more than raw compute in AI?

Modern large language models — and especially mixture-of-experts architectures — spend more time waiting for data to move between memory and compute than they spend on arithmetic. When a trillion-parameter model processes a token, it must load the relevant model weights from memory to GPU cores, and with today's model sizes, memory can only deliver those weights so fast. This is the memory bandwidth wall. HBM4's wider 2,048-bit interface addresses it directly, which is why doubling GPU count in a Blackwell cluster produces diminishing returns on mixture-of-experts workloads — more compute does not help if memory cannot feed it faster. Vera Rubin resolves this by addressing both constraints simultaneously: more memory bandwidth per GPU and more interconnect bandwidth between GPUs.