Huawei Ascend has achieved remarkable advancements in the inference performance of ultra-large-scale Mixture-of-Experts (MoE) models, with its domestic chips outperforming NVIDIA's Hopper architecture across the board. Highlighting these achievements are the CloudMatrix 384 super-node and the Atlas 800I A2 inference server, which deliver single-card Decode throughputs of 1920 Tokens/s and 808 Tokens/s, respectively, under varying latency constraints. Huawei has successfully mitigated hardware limitations through sophisticated mathematical optimization strategies, thereby bolstering the overall system capabilities. Additionally, Huawei intends to fully open-source the relevant technologies and will host a technology disclosure week this week, during which detailed technical reports and blog content will be shared.
