According to marktechpost, Moonshot AI and the research team from Tsinghua University have jointly released the Prefill-as-a-Service (PrfaaS) architecture. This architecture breaks through hardware deployment limitations for large model inference by decoupling the prefill and decoding phases. PrfaaS offloads long-context prefill tasks to an independent high-computing-power cluster and utilizes standard Ethernet to transmit KVCache to the local decoding cluster, enabling cross-data-center collaboration. Its introduced dynamic threshold routing and dual-timescale scheduler dynamically allocate resources and optimize transmission based on request length. Actual tests show that this architecture increases service throughput by 54% compared to homogeneous baselines and by 32% compared to naive heterogeneous configurations, while reducing first-character generation latency by 50%.
