The Tencent Hunyuan AI Infrastructure team has recently unveiled HPC-Ops, an open-source, production-ready, high-performance operator library tailored for large language model (LLM) inference. In practical application settings, this library has significantly boosted the inference queries per minute (QPM) of the Hunyuan model by 30%, and the QPM of the DeepSeek model by 17%. When examining the performance of individual operators, the Attention operator has demonstrated a remarkable improvement, achieving up to 2.22 times the performance of FlashInfer and FlashAttention. The GroupGEMM operator has outperformed DeepGEMM by up to 1.88 times, and the FusedMoE operator has surpassed TensorRT-LLM by up to 1.49 times.
