
Irving Marca/Unsplash
AMD and Intel have now published a full technical specification for ACE — AI Compute Extensions — the most significant overhaul to x86 AI compute in the architecture's history, co-authored by eight AMD and three Intel engineers. Fresh coverage this week of the updated ACE specification, version 1.15, marks the point at which the instruction set contract becomes a stable target for framework maintainers at PyTorch, TensorFlow, NumPy, and every HPC library in the x86 stack — even though no compatible silicon has been announced, and hardware is not expected before approximately 2028. For developers building AI inference pipelines on x86, the practical message is this: the standard is frozen, the software window is open, and the time to start planning is now.
The co-authorship itself is the first thing technically literate readers should notice. For decades, Intel has been the gatekeeper of new x86 instruction set additions — the company that proposed, defined, and shipped each extension first, while AMD followed and adapted. That pattern ended with ACE. Eight of the whitepaper's eleven named authors are AMD employees; the three Intel contributors joined in a later phase of the specification process. AMD and Intel chose not to standardize Intel's existing Advanced Matrix Extensions (AMX), which Intel has shipped in Xeon server processors since Sapphire Rapids in January 2023. ACE is a separate, new extension — not backward-compatible with AMX — and the choice to build a fresh architecture rather than adopt Intel's existing one signals that AMD's influence over x86's direction has reached a level the industry has never seen before.
The core problem ACE solves is architectural, and understanding it requires a brief detour into how modern CPUs handle math.
Traditional SIMD (Single Instruction, Multiple Data) extensions — including Intel and AMD's AVX10, the current state of the art — are fundamentally one-dimensional. They operate on vectors: arrays of numbers processed in parallel along a single axis. Matrix multiplication, the mathematical backbone of every neural network and large language model, is inherently two-dimensional: two grids of numbers are combined to produce a third. Forcing that 2D problem into a 1D SIMD instruction set is, as Tom's Hardware put it, "technically a hack, as AVX wasn't designed with 2D matrix operations in mind." GPUs solved this problem by building dedicated tensor cores — hardware explicitly designed for 2D matrix operations. CPUs never did.
ACE's solution is to add eight new 2D tile registers to the x86 architecture, each storing a 16×16 matrix of 32-bit values (512 bits wide per tile). ACE instructions operate on these tiles using an outer-product algorithm: rather than processing one row at a time as SIMD does, the hardware computes the contribution of every row-column intersection simultaneously, consuming two 16×4 input matrices at 8-bit precision in a single pass. The result is a claimed 16× compute density improvement over an equivalent AVX10 multiply-accumulate operation using the same number of input vectors.
Read more: x86 Data Center Dominance Ends: Arm Crosses 50% Hyperscaler CPU Share at Computex
That 16× figure requires careful interpretation. It measures compute density — how much mathematical work is accomplished per instruction — not end-to-end application speedup. Real-world performance gains will depend on how much die area AMD and Intel dedicate to ACE units in their silicon implementations, on memory bandwidth, and on how well compilers can map workloads onto the new instructions. What is not in question is the instruction overhead reduction: each ACE operation performs far more work than the AVX10 loop it replaces, which means less CPU time spent fetching and decoding instructions and better utilization of available memory bandwidth.
On data format support, ACE adds capabilities AVX10 does not have. The specification natively supports INT8, OCP FP8, OCP MXFP8, OCP MXINT8, and BF16, including Open Compute Project MX block-scaled formats specifically designed for the precision tradeoffs that make large language model inference efficient at scale. ACE v1 describes eleven data formats in total, per the ACE whitepaper published April 15, 2026 on the x86 Ecosystem Advisory Group's website.
The question most technically informed readers will ask is why AMD and Intel did not simply standardize Intel's existing AMX technology. AVX10 essentially standardized AVX-512 across both vendors, so the same approach for AMX seemed logical.
The consortium chose differently, and the reasons illuminate something important about x86's future. Intel AMX is available only in Xeon server CPUs — it was never designed for the client market, never intended for laptops or embedded devices. ACE, by contrast, is explicitly designed to span the full x86 range, from data center servers to Ryzen mobile SoCs. The specification states that ACE hardware may be implemented per-core or shared across a cluster of cores, allowing vendors to tune die-area investment to the target market segment. AMX does not offer that flexibility.
There is also an ownership question. AMX is Intel's proprietary design, and standardizing it would have handed Intel architectural control of x86's AI acceleration path at the precise moment AMD has been asserting its own influence. The EAG's choice of a new, jointly developed extension — with AMD in the primary authorship role — establishes that neither company owns this standard. A developer writing ACE-optimized code is writing x86 code, not Intel code.
Jim McGregor, principal analyst at TIRIAS Research, set appropriate expectations for what the extension can and cannot do. "The CPU will never be more efficient than the GPU/AI accelerator," he said. "However, it does allow you to offload some AI workloads to the CPU and/or use the CPU for AI workloads in applications that may not have a GPU or AI accelerator, such as embedded/edge applications."
That framing is precisely the right lens. ACE is not positioning CPUs to displace Nvidia tensor cores or Apple's tightly integrated AMX (a separate, proprietary matrix engine in Apple Silicon unrelated to Intel AMX). It is establishing a consistent, fragmentation-free AI compute baseline across the largest installed base of compute hardware on the planet — every x86 machine from laptops to supercomputers — for the workloads where CPUs already handle the computation, just much less efficiently.
The history of x86 instruction set extensions is littered with fragmentation disasters.
Intel proposed AVX-512 in 2013 and shipped it in 2016 on Xeon Phi. AMD adopted it years later, with a different subset of sub-extensions than Intel's consumer and server parts. When Intel introduced hybrid CPU designs combining performance and efficiency cores in Alder Lake, the problem became acute: efficiency cores lacked AVX-512 hardware entirely, meaning the operating system could crash a workload simply by migrating it from a performance core to an efficiency core. Intel disabled AVX-512 on consumer chips entirely for multiple product generations. Linus Torvalds, the creator of Linux and a luminary of the x86 EAG, wrote in 2020 that he hoped AVX-512 would "die a painful death."
The resulting developer behavior was rational: most ISVs declined to target AVX-512 at all, preferring to write for the lowest common denominator rather than maintain multiple code paths for an instruction set with unpredictable availability.
ACE's co-authorship model is explicitly designed to prevent this outcome. By standardizing the extension jointly — with both AMD and Intel committed before any silicon ships — the EAG has created a target developers can invest in with confidence. "As expected, changes to the instruction set can take a generation or two to filter through the product lines of both companies," McGregor said. "However, working together is a huge advantage for the x86 architecture."
Read more: Arm Builds Its Own Data Center CPU: The AGI Chip That Takes On x86 and Its Own Customers
For workload architects thinking practically about where ACE matters, McGregor's frame points to three categories of applications where the absence of a dedicated GPU or AI accelerator is the operating assumption: edge inference, embedded systems, and privacy-sensitive on-device processing.
Embedded computing in automotive systems, industrial robotics, medical devices, and satellite hardware cannot always accommodate a discrete GPU. Small-model inference for latency-sensitive tasks — voice recognition, code completion, local summarization — benefits from CPU-resident execution that avoids the overhead of data transfer to and from a separate accelerator. In all of these contexts, the ability to run matrix multiply operations natively on a CPU, using standardized instructions that work identically across AMD and Intel silicon, changes the economics of what is feasible.
With ACE-capable CPU cores delivering dense matrix throughput, the question of whether discrete NPUs remain necessary for these workloads becomes a genuine design consideration for chip architects. Intel and AMD have both advertised theoretical maximum platform TOPS values combining CPU, GPU, and NPU performance; ACE should substantially raise the CPU component of that figure, potentially changing how architects balance dedicated AI accelerator area against general-purpose compute.
No x86 processor with ACE hardware support has been announced as of the specification's publication. AMD's current roadmap references a "new Matrix Engine" and "AI Data Format Expansion" for the Zen 7 microarchitecture, projected for approximately 2028, which analysts have interpreted as a likely ACE implementation. Neither AMD's Zen 6 — currently ramping in EPYC Venice server CPUs — nor Intel's upcoming Nova Lake has mentioned ACE support. The implementation details that matter most for actual performance — whether ACE units will be dedicated per-core or shared across a cluster, and how much die area each vendor allocates — remain within each company's discretion and have not been disclosed.
What is available now is the instruction set contract itself. The ACE specification, version 1.15, is available for download on the x86 Ecosystem Advisory Group's website. Software enablement is already underway: integrations are in progress for deep learning and HPC libraries, including lower-precision GEMM kernels and LLM-specific primitives, as well as for Python libraries including NumPy and SciPy and for the major ML frameworks PyTorch and TensorFlow.
The engineering sequence is intentional. Publishing the specification before silicon ships gives compiler teams, framework maintainers, and library authors approximately two years to reach maturity before the first hardware arrives. History consistently shows that software readiness — not silicon capability — is the binding constraint on instruction set adoption. The AVX-512 era demonstrated this at painful cost. ACE's co-authors appear to have internalized that lesson.
For ML framework maintainers and HPC library authors, the most productive read of ACE's current state is that the API surface is now stable and co-signed by both vendors. The green light to begin planning ACE integration is on. Two years of lead time is the opportunity to avoid the pattern that doomed AVX-512 adoption: software that was not ready when hardware shipped, followed by developers who skipped the extension entirely because targeting it felt too risky.
For hardware teams and workload architects, the more interesting near-term question is which AI workloads realistically belong on a CPU rather than a GPU, and how a 16× improvement in matrix compute density shifts that calculus. Edge inference, privacy-constrained processing, and embedded applications without GPU options all look meaningfully different once CPU matrix throughput is no longer an architectural afterthought.
ACE's arrival does not resolve the competitive pressure x86 faces from Arm architectures, which have already captured the majority of hyperscaler CPU compute in new data center deployments. But it closes the gap on the specific deficiency — native 2D matrix computation — that has made CPUs nearly useless for AI inference workloads over the past several years. For x86's future in a world where AI compute is everywhere and GPU access is not, closing that gap is a structural prerequisite, not a marginal improvement.
Will ACE replace Intel's existing AMX extension?
Almost certainly, over time. ACE is a new, jointly standardized extension that is not backward-compatible with Intel's Advanced Matrix Extensions (AMX), which currently exists only in Intel Xeon server processors. AMD never implemented AMX, and the EAG's decision to develop ACE as a fresh standard rather than extend AMX implies that ACE is intended as the long-term standard for x86 matrix acceleration. Technical analysis of both specifications concludes that ACE is positioned to do to Intel AMX what AVX10 did to AVX-512 — supplant it with a unified, jointly owned standard that neither vendor controls alone.
How does ACE run matrix multiplication on a CPU without a GPU?
ACE adds eight 2D tile registers to the x86 architecture, each storing a 16×16 matrix of values. Using an outer-product algorithm, ACE instructions compute the contribution of every row-column intersection in a matrix multiplication in a single hardware pass, rather than iterating row-by-row as conventional SIMD instructions do. This delivers a claimed 16× compute density advantage over an equivalent AVX10 multiply-accumulate operation. The practical consequence is that x86 CPUs will be able to run AI inference — particularly small models, low-latency tasks, and embedded systems without a discrete GPU — without the data-transfer overhead of offloading to a GPU.
When will CPUs with ACE support ship?
No x86 processor with ACE hardware has been announced. Based on AMD's publicly stated roadmap, which references a "new Matrix Engine" for the Zen 7 microarchitecture, the earliest likely arrival is approximately 2028. Intel's corresponding timeline has not been disclosed. Zen 6 and Intel's upcoming Nova Lake have not referenced ACE support. The ACE specification is available now, and software integration with PyTorch, TensorFlow, NumPy, and HPC libraries is already underway — the standard is designed so that software ecosystems can mature before hardware ships.
Does ACE make x86 CPUs competitive with Nvidia GPUs for AI workloads?
No, and the specification makes no such claim. Jim McGregor of TIRIAS Research was direct: the CPU will never match the efficiency of a GPU or dedicated AI accelerator for the most demanding AI workloads. ACE's purpose is different. It establishes a consistent, high-density matrix compute capability across the entire x86 install base for the categories of AI workloads where CPUs already run the computation but do so inefficiently — edge inference, privacy-sensitive on-device processing, and embedded systems without GPU options.
