DeepSeek and Peking University Jointly Open-Source DSpark: Boosting High-Concurrency Inference Speed by 60% to 85% - AI

7 x 24 Track global technological trends

Hot Topic

Day

News Topic

DeepSeek and Peking University Jointly Open-Source DSpark: Boosting High-Concurrency Inference Speed by 60% to 85%

22 hour ago / Read about 0 minute

Author：小编

DeepSeek has partnered with Peking University to launch the DSpark inference acceleration framework, specifically designed to address the efficiency challenges of large language models in high-concurrency production environments. It has now been successfully applied to the DeepSeek-V4-Flash and DeepSeek-V4-Pro preview service engines. Compared to the single-token speculative decoding baseline MTP-1, DSpark achieves a 60% to 85% increase in generation speed per user while maintaining equivalent throughput. The framework's paper and training code have been made publicly available on GitHub.DSpark innovatively introduces a semi-autoregressive architecture and a confidence-aware scheduling verification mechanism to tackle two major bottlenecks in speculative decoding: candidate generation quality and computational resource consumption during the verification phase. The semi-autoregressive architecture generates hidden states and base logits for candidate positions in one pass through a parallel backbone network, then incorporates prefix dependency information via a lightweight sequential module, significantly improving parameter efficiency. The confidence-aware scheduling verification mechanism dynamically adjusts verification length based on candidate confidence scores using a hardware-aware prefix scheduler, optimizing computational resource allocation.Offline benchmark tests demonstrate that DSpark outperforms autoregressive draft model Eagle3 and parallel draft model DFlash in average acceptance length per round across mathematical reasoning, code generation, and daily conversation tasks. For production deployment, DSpark's draft model employs a specialized architecture to reduce communication complexity and computational memory overhead through system optimizations, while addressing engineering constraints via asynchronous scheduling and decoupling physical execution from logical sequence tracking. Online real-world testing shows significant throughput improvements across different engines and SLAs, with the scheduler demonstrating adaptive verification budget allocation based on workload. However, the framework currently has a limitation where draft computation overhead for complete initial candidate blocks cannot be recovered when processing complex queries.

Previous page：US Government Set to Remove Restrictions on Anthro...

Next page：Anthropic: Mythos 5 Approved for Redeployment to U...

Return to List

Hot Reading

2 day ago

AI Data Center Water Use Is Not Solved: Nvidia's Cooling Fix Stops at the Walls

1 day ago

Electric Fan Car McMurtry Spéirling PURE: 95% New, Full Reveal Next Week

2 day ago

Notion killing Skiff-influenced email app since most users use AI agents instead

1 day ago

MWC Shanghai 2026 Closes: Huawei Pushes U6 GHz as First Commercial 5G-A Launches Loom