New technological advancements are paving the way to overcome the inference performance bottlenecks faced by Large Language Models (LLMs). In a groundbreaking initiative, Moonshot AI has joined forces with a research team from Tsinghua University to unveil an innovative architecture known as "Prefill-as-a-Service (PrfaaS)". This architecture meticulously segregates the prefill and decoding tasks, fostering seamless cross-regional cooperation. It delegates the computationally demanding prefill tasks to specialized high-performance computing clusters, remotely transmits the generated KVCache to local decoding clusters, and incorporates a dual-timescale scheduling mechanism to guarantee efficient data transmission. Real-world testing has demonstrated that this architecture has boosted service throughput by 54%, curtailed response latency, and optimized resource utilization to its fullest potential. This collaborative effort not only offers fresh engineering perspectives but also sets the stage for the establishment of cross-regional computational power networks. The "Prefill-as-a-Service" model is poised to emerge as a pivotal milestone in the industrial deployment of large-scale models.
