DeepSeek and Peking University Jointly Open-Source DSpark: Boosting High-Concurrency Inference Speed by 60% to 85%
22 hour ago / Read about 0 minute
Author:小编   

DeepSeek has partnered with Peking University to launch the DSpark inference acceleration framework, specifically designed to address the efficiency challenges of large language models in high-concurrency production environments. It has now been successfully applied to the DeepSeek-V4-Flash and DeepSeek-V4-Pro preview service engines. Compared to the single-token speculative decoding baseline MTP-1, DSpark achieves a 60% to 85% increase in generation speed per user while maintaining equivalent throughput. The framework's paper and training code have been made publicly available on GitHub.DSpark innovatively introduces a semi-autoregressive architecture and a confidence-aware scheduling verification mechanism to tackle two major bottlenecks in speculative decoding: candidate generation quality and computational resource consumption during the verification phase. The semi-autoregressive architecture generates hidden states and base logits for candidate positions in one pass through a parallel backbone network, then incorporates prefix dependency information via a lightweight sequential module, significantly improving parameter efficiency. The confidence-aware scheduling verification mechanism dynamically adjusts verification length based on candidate confidence scores using a hardware-aware prefix scheduler, optimizing computational resource allocation.Offline benchmark tests demonstrate that DSpark outperforms autoregressive draft model Eagle3 and parallel draft model DFlash in average acceptance length per round across mathematical reasoning, code generation, and daily conversation tasks. For production deployment, DSpark's draft model employs a specialized architecture to reduce communication complexity and computational memory overhead through system optimizations, while addressing engineering constraints via asynchronous scheduling and decoupling physical execution from logical sequence tracking. Online real-world testing shows significant throughput improvements across different engines and SLAs, with the scheduler demonstrating adaptive verification budget allocation based on workload. However, the framework currently has a limitation where draft computation overhead for complete initial candidate blocks cannot be recovered when processing complex queries.