
In this photo illustration, the DeepSeek app is displayed on an iPhone screen on January 27, 2025 in San Anselmo, California. Newly launched Chinese AI app DeepSeek has surged to number one in Apple's App Store and has triggered a sell-off of U.S. tech stocks over concerns that Chinese companies' AI advances could threaten the bottom line of tech giants in the United States and Europe. Justin Sullivan/Getty Images
DeepSeek on June 27 released DSpark, an inference optimization framework that the company says makes its production V4-Flash model generate responses up to 85 percent faster than the prior single-token baseline — without retraining the model, changing its weights, or adding new hardware. The framework, built on a technique called speculative decoding, is now live across V4-Flash and V4-Pro and is available as open-source code under an MIT license. For any organization currently paying to serve DeepSeek V4 at scale, the speed gain is a direct reduction in compute cost per output token. For developers evaluating whether to self-host V4 weights, DSpark is the serving layer that makes that architecture look substantially more competitive.
What DeepSeek did not release is a new model. The Hugging Face model cards for DeepSeek-V4-Pro-DSpark and DeepSeek-V4-Flash-DSpark are explicit: both are the same checkpoint with an additional speculative decoding module attached.
Standard autoregressive generation forces a model to produce tokens one at a time. For a system like V4-Pro — which has 1.6 trillion total parameters with 49 billion activated per forward pass through its Mixture-of-Experts architecture — each single token requires a full and expensive forward pass through the model. At high user concurrency, this sequential bottleneck burns through GPU memory bandwidth faster than it burns through compute, leaving processing units partially idle between steps.
Speculative decoding attacks the bottleneck differently. A small, fast draft model proposes a block of candidate tokens simultaneously. The full target model then verifies the entire block in a single forward pass. If the draft tokens are correct, the system gets multiple tokens for approximately the cost of one verification step. Crucially, the verification process uses rejection sampling to guarantee that the output distribution is mathematically identical to standard autoregressive generation — the output quality is unchanged.
DSpark's contribution to this established technique comes in three engineering innovations that address real production failure modes.
The first is semi-autoregressive generation. Standard parallel draft models generate an entire block of tokens in one pass, treating each position independently. This speed advantage comes at a cost: each position ignores its neighbors, which causes what researchers call suffix decay — later tokens in the block become progressively less likely to be accepted because the draft had no information about what came before. DSpark addresses this with a hybrid architecture: a fast parallel backbone generates base probability estimates for every position, and a lightweight sequential Markov head then adds a token-conditioned bias before each token is sampled. The Markov head uses a low-rank factorization at rank 256 to stay computationally cheap even with large model vocabularies. Once position one produces a specific token, the Markov head boosts the conditional probabilities for tokens likely to follow it, reducing suffix decay without the full cost of sequential drafting.
The second is confidence-scheduled verification. Standard speculative decoding sends all draft tokens to the target model for verification — even tokens the system already suspects will be rejected, wasting the verification budget on certain losses. DSpark adds a confidence head that scores each draft token's probability of acceptance. A hardware-aware prefix scheduler then uses those confidence scores to dynamically trim the verification length: when GPUs are idle, the scheduler verifies longer prefixes; when GPUs are under load, it cuts aggressively to the highest-confidence tokens. This adaptive trimming eliminates wasted verification cycles without compromising the lossless output guarantee.
The third is Zero-Overhead Scheduling, or ZOS. In a high-concurrency production environment, scheduling decisions themselves consume time that adds to latency. DSpark's scheduler operates asynchronously, using predictions from the prior two verification steps to determine the current optimal truncation length before the decision is needed. This allows continuous CUDA graph replay without stalls, hiding scheduling latency entirely.
The deployed configuration — called DSpark-5 — uses a five-token draft block with the Markov head. In DeepSeek's internal production data, DSpark-5 improved per-user generation speed by 60 to 85 percent on V4-Flash and 57 to 78 percent on V4-Pro compared to the prior MTP-1 baseline, while keeping overall system throughput constant. The per-token latency equation from the paper is direct: L equals draft time plus verification time, divided by the number of accepted tokens per cycle. Every additional accepted token shrinks L proportionally.
DeepSeek's paper — formally titled "DSpark: Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation," co-authored by DeepSeek founder Liang Wenfeng with researchers at Peking University — reports offline benchmark results across mathematical reasoning, code generation, and dialogue on the Qwen3 model series. On Qwen3-4B, DSpark improved the macro-average accepted token length by 30.9 percent over Eagle3, the prior state-of-the-art autoregressive speculative drafter. On Qwen3-8B and Qwen3-14B, the improvements over Eagle3 were 26.7 percent and 30.0 percent, respectively. Against DFlash, a competing parallel draft framework, DSpark achieved improvements of 16.3 to 18.4 percent across the same model sizes. A 2-layer DSpark configuration outperformed a 5-layer DFlash — achieving a better acceptance rate with a smaller and computationally cheaper draft model.
All benchmark figures in these comparisons are self-reported by DeepSeek in the paper. No independent third-party verification of DSpark's acceptance-length or production speed claims has been published as of June 28, 2026. The accepted-token-length gains over Eagle3 are described by AI Weekly as worth testing, while that outlet notes that every speedup figure is benchmarked against DeepSeek's own prior technique on DeepSeek's own infrastructure.
One caveat on the throughput figures: the most dramatic aggregate numbers — gains of several hundred percent in peak-load scenarios — occur when the prior MTP-1 baseline is operating near its service-level agreement boundary and can sustain only a small concurrent batch. These figures reflect where DSpark extends the feasible serving frontier; they are not a universal deployment multiplier for every team running V4.
Read more: DeepSeek V4 Architecture: How Sparse Attention Cuts Inference Costs, What NIST Found
Alongside the DSpark framework, DeepSeek released DeepSpec — a full-stack codebase for training and evaluating speculative decoding draft models — under an MIT license on GitHub. DeepSpec is not an inference-only tool: it includes a three-stage pipeline covering data preparation, multi-GPU draft model training, and evaluation across nine benchmarks including GSM8K, MATH500, HumanEval, and LiveCodeBench. The repository currently bundles three draft model algorithms — DSpark, DFlash, and Eagle3 — and targets the Qwen3 and Gemma model families.
The significance of open-sourcing the full training stack, rather than just releasing pretrained weights, is that infrastructure teams can train custom draft models tuned to their own prompt distributions and hardware configurations, rather than relying on what DeepSeek releases. Community testing confirmed by Daniel Han at Unsloth verified that DSpark trains cleanly on Qwen3 and Gemma targets — meaning teams not using DeepSeek models at all can apply the method.
There is a practical constraint to note. The default Qwen3-4B configuration in DeepSpec requires a single node with eight GPUs and approximately 38 terabytes of storage for the target cache. That requirement sets a real floor on who can run the full training pipeline today. Using the already-trained DSpark checkpoints via vLLM or SGLang is a separate and more accessible path that does not require the full training infrastructure.
The efficiency gains from DSpark carry a specific policy implication that the performance numbers do not make explicit. Since early 2025, the United States has relied on export controls restricting China's access to advanced Nvidia H100 and H800 graphics processing units as a primary mechanism for limiting the pace of Chinese AI development. DeepSeek's history — training frontier-competitive models on export-restricted hardware at a fraction of Western costs — already challenged that assumption. DSpark extends the challenge further: if a Chinese AI lab can achieve 60 to 85 percent faster inference throughput from existing chips through software optimization alone, the relationship between chip count and AI capability that the export-control strategy assumes becomes significantly less predictable.
The export-control approach assumes that constraining hardware access constrains capability at a roughly proportional rate. DSpark, like DeepSeek's earlier Mixture-of-Experts efficiency work, demonstrates that the rate is not fixed. This does not make export controls ineffective, but it does make them less reliably sufficient as a standalone policy tool — a distinction policymakers and legislators actively evaluating DeepSeek restrictions should weigh.
At the same time, DeepSeek's DSpark paper was co-authored with Peking University researchers. According to reporting by the New York Times, dozens of DeepSeek researchers have had affiliations with People's Liberation Army laboratories and universities known as the Seven Sons of National Defence. That research relationship does not make DSpark a state security tool, but it is context that enterprise buyers and policymakers should weigh when evaluating DeepSeek's open-source releases.
The speed story and the legal story are separable — and both apply to enterprise decision-making.
The speed story applies regardless of deployment model. DSpark's inference gains are real engineering: whether a team uses DeepSeek's hosted API or self-hosts the open weights, the faster generation rate exists. Teams self-hosting DeepSeek V4 weights on non-Chinese infrastructure remove the direct data-routing concern entirely — user prompts do not reach DeepSeek's servers.
The legal story applies directly to the hosted API and cloud service. DeepSeek's privacy policy, as updated on February 10, 2026, explicitly states that personal data is collected, processed, and stored in the People's Republic of China. China's National Intelligence Law (2017), Article 7, requires all organizations and citizens to support, assist, and cooperate with national intelligence work — an obligation that applies to DeepSeek as a PRC-incorporated company. China's Cybersecurity Law (2017) additionally requires network operators to allow government spot-checks of their infrastructure. The PRC Data Security Law (2021) creates additional data localization and government-access obligations. These are legal conditions imposed by Chinese law, not claims about DeepSeek's corporate intentions. DeepSeek has denied intentional government data sharing, but that denial does not change what Chinese law requires.
In January 2025, cybersecurity firm Wiz Research disclosed that a publicly accessible, unauthenticated DeepSeek database was exposing more than one million log entries — including plaintext chat histories and API authentication keys — with no authentication required. DeepSeek secured the database after notification. Feroot Security separately found code in DeepSeek's web infrastructure with backend links to a portal operated by China Mobile, a state-owned carrier banned from US operations by the Federal Communications Commission in 2021. South Korea's Personal Information Protection Commission confirmed in April 2025 that DeepSeek transmitted user prompt data to Chinese entities without obtaining prior consent from South Korean users.
Italy's Garante banned DeepSeek from distributing its app in January 2025; the ban remains in force. Australia, Taiwan, South Korea, and at least 17 US states have issued restrictions on government-device use. The US Navy, NASA, and the Department of Commerce have prohibited the service on official systems. Bipartisan federal legislation — the No DeepSeek on Government Devices Act (H.R.1121 / S.765) — was introduced in Congress in February 2025 and remains pending. On June 18, 2026, Representatives Josh Gottheimer and Darin LaHood introduced new legislation specifically naming DeepSeek's code connection to China Mobile as grounds for a federal ban.
It is worth being precise about scope. The data exposure and legal-access concerns described above apply to the hosted service — the web app and the API as served by DeepSeek from Chinese infrastructure. Developers who self-host the open model weights on infrastructure outside China avoid the direct data-routing risk. The National Intelligence Law obligations remain attached to DeepSeek as a company regardless of where its weights run, but users of self-hosted instances are not routing data through DeepSeek's servers.
No independent, named security audit of DSpark's weights or the DeepSpec codebase has been published as of the date of this article.
Read more: Microsoft Eyes DeepSeek V4 for Copilot Cowork: What Azure Hosting Cannot Fix
DSpark's release reflects a structural shift in how AI performance competition works. For most of 2024 and 2025, the primary metric was benchmark performance on reasoning and coding tasks. That competition has partially homogenized: multiple models from multiple providers now pass the same benchmarks at comparable quality levels. The differentiator is increasingly the infrastructure layer around the model — the API routing systems, inference schedulers, speculative decoding frameworks, and serving architectures that determine how quickly, cheaply, and reliably a model delivers its outputs at production scale.
Deloitte projected in November 2025 that inference workloads would account for roughly two-thirds of all AI compute in 2026, up from one-third in 2023. DSpark positions DeepSeek to capture a portion of that economics shift through software optimization rather than hardware acquisition — a pattern the company has now repeated across both the training side (efficiency through Mixture-of-Experts architecture) and the serving side (efficiency through speculative decoding).
OpenAI and Anthropic deploy speculative decoding or similar acceleration architectures in their own production serving stacks but have not open-sourced their full-stack inference toolchains. DeepSpec's MIT release puts a reproducible, auditable version of this engineering in the hands of the broader research community — and specifically in the hands of teams building on Qwen3 and Gemma who are not part of DeepSeek's model ecosystem at all. Whether the broader industry ultimately converges on DSpark's semi-autoregressive approach, or uses DeepSpec as the benchmark against which competing methods are measured, the release raises the technical floor for open-source inference optimization.
What is speculative decoding, and why does it make AI faster?
In standard large language model generation, the model produces one token at a time, each requiring a full forward pass through the model's billions of parameters. Speculative decoding interrupts this sequential chain by using a small, fast draft model to propose several candidate tokens simultaneously. The full target model then verifies the entire block in a single forward pass. Because the verification step uses rejection sampling to preserve the exact output distribution as standard generation, the output is mathematically identical to what the model would have produced without speculative decoding — just delivered faster when the draft model guesses correctly.
Does DSpark change DeepSeek V4's intelligence or output quality?
No. DeepSeek-V4-Pro-DSpark and DeepSeek-V4-Flash-DSpark use the same model weights as the standard V4 checkpoints. DSpark adds a speculative decoding module to accelerate the serving layer; it does not alter the model's parameters, training data, or output distribution. Losslessness is a mathematical property of the rejection sampling technique DSpark uses — not a configuration choice that could degrade quality.
What is the data security risk of using DeepSeek's hosted API?
When user data passes through DeepSeek's hosted cloud service — the API or the web app — it transits to servers in China and becomes subject to Chinese national law. China's National Intelligence Law (2017), Article 7, legally obligates all organizations to cooperate with state intelligence work on request. DeepSeek's own privacy policy confirms data is stored in the PRC. These are structural legal conditions, not claims about DeepSeek's corporate intentions. Organizations handling sensitive data, protected health information, financial records, or defense-adjacent information should treat the hosted API as incompatible with their data governance obligations. Self-hosting the open model weights on infrastructure outside China eliminates the direct data-routing exposure, though it does not change the legal obligations of the company that produced the model.
Can DSpark be used with AI models other than DeepSeek V4?
Yes. Community testing confirmed by Daniel Han at Unsloth verified that DSpark trains cleanly on Qwen3 and Gemma model families, and the DeepSpec repository ships configuration files for both. Teams looking to apply speculative decoding to their own fine-tuned variants of Qwen3 or Gemma can use DeepSpec's training pipeline to build custom draft models tuned to their own prompt distribution. The main practical constraint is the storage requirement: the default Qwen3-4B training configuration requires approximately 38 terabytes of storage for the target cache, plus an 8-GPU training node.
