Tencent Hunyuan AI Infra New Open-Source Release: Comprehensive Upgrade of HPC-Ops Inference Core Operators
10 hour ago / Read about 0 minute
Author:小编   

HPC-Ops has released an open-source upgrade featuring five key operators, designed to enhance the adaptability of inference systems to dynamic workload demands and meet the requirements of core modules for complex precision and high-performance fused operators. This upgrade effectively addresses multiple engineering bottlenecks on mainstream inference platforms, such as long-tail latency in Attention, memory transfer overhead, and cross-card communication issues, surpassing existing open-source baselines in several performance metrics. Key improvements include: The Attention operator achieves up to a 2.95x speedup in long-text processing and a 17% improvement in end-to-end QPM through dynamic workload scheduling; Router GEMM utilizes a dual BF16 GEMM combination to achieve FP32-level precision, delivering a 3.22x speedup over CuBLAS FP32; FusedMoE constructs a full-module pipeline, improving performance by 1.2x to 1.6x compared to vLLM and SGLang; Fused AllReduce+Norm fuses cross-GPU communication and computation, achieving a 1.04x to 1.68x speedup over NCCL and FlashInfer; Sampler integrates sampling computation into 2 CUDA Kernels, delivering a 4.0x to 7.5x speedup over vLLM and a 1.9x to 4.7x speedup over FlashInfer.