After dedicating over a year of effort, AMD, in partnership with IBM and Zyphra, has triumphantly trained ZAYA1, the first sizable Mixture of Experts (MoE) foundational model, entirely on AMD hardware. This model was constructed within the AMD hardware ecosystem, and its training journey was carried out on the IBM Cloud, leveraging cutting-edge hardware like the AMD Instinct MI300X GPU.
The trio worked hand in hand to establish a training cluster, which was made up of 128 nodes and a staggering total of 1,024 GPUs. This setup delivered real-world training performance that soared past 750 PFLOPs. Zyphra took charge of crafting the optimized training framework.
During the pre-training stage, ZAYA1 drew from a massive pool of 14 trillion tokens of data and implemented a phased curriculum learning approach. When put through benchmark tests, ZAYA1's overall performance proved to be on par with the Qwen3 series and even outperformed models like SmolLM3 and Phi4.
ZAYA1's remarkable performance can be largely credited to two groundbreaking innovations: the CCA attention mechanism and enhanced linear routing. Right now, a sneak-peek version of the foundational model has been made available. A fully polished post-training version is set to be revealed down the line, along with shared evaluation results and insights from the development process.
