
Cerebras cerebras.com
OpenAI launched its first model on non-Nvidia hardware in February, slashing AI coding response times from seconds to milliseconds — and in less than five months, that experiment has produced a homegrown inference chip, a government-reviewed frontier model, and the clearest picture yet of what post-Nvidia AI infrastructure looks like. On Friday, Reuters reported that OpenAI confirmed it would delay the full public rollout of GPT-5.6 at the request of the Office of the National Cyber Director and the Office of Science and Technology Policy — a national security intervention that also marks the first time a frontier model release has been formally staged by the U.S. government. Developers and enterprise teams expecting access to GPT-5.6 Sol, OpenAI's most capable model to date, now face a case-by-case federal approval process during the preview period.
The story starts with a piece of silicon roughly the size of a dinner plate. On February 12, OpenAI released GPT-5.3-Codex-Spark — a streamlined, latency-optimized coding model — and deployed it on Cerebras Systems' third-generation Wafer Scale Engine. The WSE-3 is not a conventional chip. Where a standard AI accelerator die covers roughly 815 square millimeters before yield problems make it unmanufacturable, the WSE-3 spans 46,225 square millimeters — the full area of a 300mm silicon wafer — integrating 4 trillion transistors, 900,000 AI-optimized cores, and 44 gigabytes of on-chip static RAM, fabricated on TSMC's 5nm process. That size is the point.
The bottleneck in AI code generation — and in large language model inference more broadly — is not arithmetic. It is data movement. Every time a GPU cluster generates a token, it must read the model's weights from off-chip memory, ferry that data across high-speed interconnects like NVLink, and then run the computation. At the scale of modern language models, that memory-to-compute pipeline becomes the rate-limiting step, a constraint computing researchers call the memory wall. The WSE-3 attacks this problem architecturally: its 21 petabytes per second of on-chip memory bandwidth — roughly 7,000 times that of a single Nvidia H100 — means weights stay close to the cores that need them, reducing the time each token generation cycle spends waiting for data.
The result, on Codex-Spark: over 1,000 tokens per second, as OpenAI measured internally. That figure is OpenAI's own benchmark under optimized conditions; one independent developer reported a more modest improvement in practical coding sessions, a reminder that laboratory throughput numbers do not always transfer cleanly to real-world workflows. Still, even in the conservative case, the latency shift is legible to a developer: a 30-line function that previously took several seconds to appear in full now arrives in under a second.
Codex-Spark rolled out as a research preview to ChatGPT Pro subscribers through the Codex app, the command-line interface, and the Visual Studio Code extension. The model runs with a 128,000-token context window and is text-only — deliberate constraints that keep it lean enough to benefit from the WSE-3's on-chip memory without requiring external storage.
Read more: Cerebras After Its IPO: How Wafer-Scale Chips Challenge Nvidia Inference
There is a limit to what the WSE-3 can do in its fastest mode. The 44 GB of on-chip static RAM that makes Codex-Spark so responsive is enough to hold smaller or mid-sized models entirely on-chip — the operating condition that delivers the headline token-per-second numbers. But the largest frontier models, including GPT-5.6, exceed that on-chip capacity by a wide margin. For those models, Cerebras uses weight streaming, where model weights live in off-chip storage — a configuration called MemoryX, which can scale to over a petabyte of external capacity — and load layer by layer across the wafer. This approach still reduces cross-chip interconnect latency compared to GPU clusters, but it adds an off-chip data dependency that the on-chip mode avoids. The reported figure of 750 tokens per second cited in leaks for a potential GPT-5.6 deployment on Cerebras is consistent with this weight-streaming mode rather than the pure on-chip mode that powers Codex-Spark.
That engineering detail matters for understanding what OpenAI actually demonstrated on Cerebras, and what remains to be proven. Codex-Spark on the WSE-3 showed that wafer-scale inference works in production for models that fit on-chip. A frontier model deployment would test a harder proposition: whether weight-streaming wafer-scale inference is faster, cheaper, or more reliable than equivalently scaled GPU clusters for the models that actually power ChatGPT.
Two days before the GPT-5.6 announcement, OpenAI revealed it had been working on something beyond partnerships. On June 24, OpenAI and Broadcom jointly unveiled Jalapeño — OpenAI's first custom-designed AI accelerator, developed over 18 months and manufactured by TSMC. The chip was designed from the start to run inference workloads for large language models rather than training — the same problem domain as the Cerebras deployment, but addressed by owning the silicon rather than licensing access to someone else's. Engineering samples arrived at OpenAI's San Francisco headquarters on June 24; OpenAI confirmed it has already begun running Codex-Spark on Jalapeño chips in a production test environment.
Broadcom CEO Hock Tan told Bloomberg that early testing shows the chip delivering roughly 50% lower inference cost per token compared to current-generation Nvidia GPUs, a figure that OpenAI confirmed only at a higher level — the company's own statements describe performance per watt as "substantially better than current state-of-the-art" without specifying a percentage. OpenAI President Greg Brockman told CNBC that the chip was designed from end to end in nine months, with the development cycle compressed by using earlier OpenAI models to accelerate chip design — a feedback loop the company intends to extend as it designs future hardware.
The Cerebras partnership remains in place; the two arrangements serve different technical problems. Cerebras provides inference speed through wafer-scale memory bandwidth for latency-first workloads. Jalapeño targets inference cost efficiency at high volume — the same token-generation compute that powers every ChatGPT conversation, at lower cost per query. OpenAI has also struck separate compute agreements with AMD and Amazon Web Services, adding to a portfolio that now spans Nvidia GPUs, Cerebras wafer-scale processors, AWS Trainium, AMD Instinct GPUs, and now its own silicon.
Read more: OpenAI's First Custom AI Chip Targets 50% Cheaper Inference: Jalapeño Unveiled
The multi-vendor strategy requires a precise framing of what Nvidia's role actually is. Neither Cerebras nor Jalapeño competes with Nvidia's core function at OpenAI: training frontier models. That task requires sustained, massively parallel compute across tens of thousands of GPUs connected at high bandwidth — a workload the WSE-3 and Jalapeño are not designed to perform. Nvidia holds a $100 billion commitment from OpenAI for its Vera Rubin platform, with first deployments expected in the second half of 2026. When OpenAI CEO Sam Altman said earlier this year that OpenAI "hopes to be a gigantic customer for a very long time," it was a statement of structural dependency on the training side, not a diplomatic gesture.
What Cerebras and Jalapeño address is inference — the subset of inference workloads where latency and cost matter more than raw training throughput. For interactive coding tools, real-time reasoning applications, and high-volume API traffic, the architecture choice has direct consequences for developer experience and unit economics. QumulusAI Senior Product Manager Mark Jackson characterized the boundary plainly: the Cerebras wafer-scale architecture is best suited for narrowly defined, high-demand inference environments requiring low latency and strong throughput, while GPUs remain the practical default for most organizations because of their mature software ecosystem and training support.
The broader rollout of GPT-5.6 — specifically the Sol variant, which OpenAI describes as its most capable model to date, alongside mid-tier Terra and lower-cost Luna — arrived at a different kind of gate on June 26. The Office of the National Cyber Director and the Office of Science and Technology Policy asked OpenAI to stage the release, limiting initial access to a small group of vetted partners whose details were shared with federal authorities. OpenAI CEO Sam Altman told staff that the government would "approve access customer by customer during this preview period," according to The Information, cited by Reuters. OpenAI stated in a blog post that it was taking the step as the "strongest path to broader availability in the coming weeks, while we work with the Administration to develop the cyber Executive Order framework and a repeatable process for future model releases."
The arrangement is more permissive than a parallel June 2026 directive applied to Anthropic, which faced restrictions on foreign access to its Mythos 5 and Fable 5 models. OpenAI separately cautioned that government oversight at this scale should not become a permanent standard, noting risks to developers, cybersecurity professionals, and international partners who need timely access to frontier AI tools.
For developers planning deployments on GPT-5.6 Sol, the immediate consequence is an uncertain access timeline. For the AI hardware landscape, the government gate introduces a new variable into infrastructure planning: if frontier model access is now subject to federal staging, decisions about where and how those models are served become entangled with compliance review processes that did not exist six months ago.
The company behind OpenAI's wafer-scale inference tier is navigating its own turbulence. Cerebras completed its initial public offering on May 14, priced at $185 per share and raising $5.55 billion — the largest U.S. technology IPO since Uber's 2019 debut. Shares surged to a first-day close of $311.07. But on June 23, Cerebras reported a first-quarter 2026 loss of $0.22 per share, causing the stock to fall more than $40 per share. Multiple law firms, including Block & Leviton and Pomerantz, have opened investigations into whether Cerebras and certain officers may have violated federal securities laws.
The financial volatility does not change the technical facts about the WSE-3, but it matters for enterprise customers evaluating long-term infrastructure commitments. At IPO, two UAE-linked entities — including the Mohamed bin Zayed University of Artificial Intelligence — accounted for 86% of Cerebras revenue. The OpenAI deal, structured as a $20 billion cloud services agreement that includes warrants for Cerebras stock, represents Cerebras' most significant Western commercial relationship. Morningstar senior equity analyst Brian Colello noted that the greatest risk for Cerebras investors is "intense competition in AI inference, especially versus market leader Nvidia and its Groq business unit."
The technical question underneath the financial story is whether the WSE-3's approach — monolithic, memory-dense, highly specialized — proves durable as frontier models grow larger and inference demands more diversity in workload type. The 44 GB on-chip memory limit means the most capable models already require weight streaming. Future models will be larger still. The engineering path for wafer-scale inference is a targeted advantage in a specific workload band, competing against GPU clusters that continue to improve, Jalapeño chips being purpose-built for the same inference use case, and alternatives from Groq, Tenstorrent, and Google's Tensor Processing Units.
What Codex-Spark demonstrated is that the latency advantage is real enough to change how developers interact with AI coding tools. What Jalapeño and the government gating of GPT-5.6 have clarified is that OpenAI's post-Nvidia future is not a single-vendor substitution — it is a layered portfolio in which speed, cost, and regulatory access each demand different answers.
Why does OpenAI use Cerebras hardware instead of Nvidia GPUs for Codex-Spark?
The WSE-3 is architected specifically to reduce inference latency for AI coding workloads. Unlike GPU clusters, which move model weights between multiple chips across high-speed interconnects on every token generation cycle, the WSE-3 keeps computation and its 44 GB of on-chip static RAM on a single wafer-scale processor. For models small enough to fit entirely in that on-chip memory, each token arrives significantly faster. OpenAI still uses Nvidia GPUs for training, where massively parallel computation across many chips remains essential.
What is the difference between Codex-Spark's wafer-scale deployment and a potential GPT-5.6 deployment on Cerebras?
Codex-Spark runs in on-chip mode on the WSE-3 because its model weights fit within the 44 GB of on-chip static RAM, enabling the fastest inference times. GPT-5.6 is a larger frontier model that would exceed that capacity, requiring a different operating mode called weight streaming, where model weights are stored in Cerebras' off-chip MemoryX system and loaded layer by layer. Weight streaming still reduces inter-chip latency compared to GPU clusters, but it adds an off-chip data dependency that the on-chip mode avoids.
Why did the U.S. government delay the GPT-5.6 launch?
The Office of the National Cyber Director and the Office of Science and Technology Policy asked OpenAI to stage the GPT-5.6 Sol rollout, limiting initial access to vetted partners whose details were shared with federal agencies. The government's concern, as reported by Axios and Reuters, centers on national security risks tied to the model's cybersecurity capabilities and potential for misuse before guardrails are established. OpenAI is working with the White House on a framework for future frontier model releases under a cyber Executive Order.
What is OpenAI's Jalapeño chip, and does it replace Cerebras?
Jalapeño is OpenAI's first custom-designed AI inference chip, co-developed with Broadcom over 18 months and manufactured by TSMC. Broadcom CEO Hock Tan reported early testing showing roughly 50% lower inference cost per token than current-generation Nvidia GPUs. The Cerebras partnership remains in place. The two arrangements serve different purposes: Cerebras provides maximum inference speed through wafer-scale memory bandwidth for latency-first workloads; Jalapeño targets inference cost efficiency at high volume. OpenAI has already tested Codex-Spark on Jalapeño engineering samples.
