DeepSeek Unleashes Another Major Move, Boosting Inference Speed by 85%—How Did They Do It?
5 hour ago / Read about 0 minute
Author:小编   

On June 27, the DeepSeek team, in collaboration with Peking University, released a technical report unveiling the DSpark framework and the DeepSpec full-stack codebase. Building upon the existing DeepSeek-V4-Pro and DeepSeek-V4-Flash models, this update introduces a server-side speculative decoding module, DSpark, with a strong focus on optimizing engineering deployment efficiency. DSpark employs a semi-autoregressive generation architecture, integrating a parallel backbone network with lightweight serial modules to address the issue of declining acceptance rates in parallel draft models during long-sequence generation. Additionally, it introduces a confidence-based scheduling validation mechanism that dynamically adjusts validation lengths based on hardware status and concurrency pressure, ensuring efficient allocation of computational resources. The framework has been deployed in the DeepSeek-V4 online service system, resulting in a 60%-85% increase in single-user generation speed for V4-Flash and a 57%-78% increase for V4-Pro under equivalent system throughput conditions, all without compromising output quality. The accompanying open-source DeepSpec codebase provides end-to-end tools for data preparation, model training, and evaluation, supporting the MIT license. It includes three draft model algorithms—DSpark, DFlash, and Eagle3—and is compatible with mainstream foundation models such as Qwen3 and Gemma. This open-source release lowers the barriers to private deployment and online service implementation for large models, accelerating the large-scale adoption of intelligent agents, industrial code generation, financial sentiment analysis, and other applications.