DeepSeek and Peking University Release DSpark: 60%-85% Faster Inference for V4 Models
On June 27, DeepSeek, in collaboration with Peking University, released DSpark — a speculative decoding inference acceleration framework for the DeepSeek-V4 series. It is not a new model, but an engineering optimization layer on top of existing V4-Pro (1.6T parameters) and V4-Flash (284B parameters) checkpoints.
What It Is (and Isn't)
The Hugging Face model card is explicit: "DeepSeek-V4-Pro-DSpark is not a new model. It is the same checkpoint with an additional speculative decoding module attached." The API model IDs remain unchanged — the official API still serves deepseek-v4-pro and deepseek-v4-flash. DSpark is primarily for self-hosted deployments.
Performance Numbers
DeepSeek reports the following improvements from production traffic (not lab benchmarks):
- Per-user generation speed: 60%-85% faster on Flash, 57%-78% faster on Pro
- System throughput: 51% to 400% improvement depending on concurrency level
- Outperforms prior methods including Eagle-3 and DFlash
The framework is already deployed in DeepSeek's live production traffic.
How It Works
Autoregressive LLMs generate one token at a time. Each step requires a full forward pass through the target model — expensive for a 1.6T-parameter MoE architecture with 49B activated parameters per token.
DSpark adds a small "draft model" that generates a block of candidate tokens in parallel. The target model then verifies them in a single pass. Accepted tokens cost roughly one forward pass for multiple output tokens; rejected candidates fall back to standard decoding. DeepSeek calls this approach "semi-parallel" generation with adaptive verification.
Open Source: DeepSpec
Alongside the research paper (arXiv:2606.19348), DeepSeek open-sourced the full-stack training and evaluation codebase, named DeepSpec, under the MIT license.
DeepSpec supports three draft algorithms:
- DSpark — the new release
- DFlash — block-diffusion-style draft approach
- Eagle3 — third-party lineage
The codebase covers the full pipeline from data preparation to training to evaluation, including configuration files for Qwen3-4B and Gemma4-12B. DeepSeek has tested DSpark on Qwen and Gemma, suggesting the technique generalizes beyond the V4 family.
The V4-Pro-DSpark weights are available on Hugging Face, along with inference examples.
Deployment
The Hugging Face card documents integration paths with vLLM and SGLang:
# vLLM
vllm serve "deepseek-ai/DeepSeek-V4-Pro-DSpark"
# SGLang
python3 -m sglang.launch_server \
--model-path "deepseek-ai/DeepSeek-V4-Pro-DSpark" \
--host 0.0.0.0 --port 30000
Note: the DSpark weights do not ship with a Jinja chat template. Use DeepSeek's encoding_dsv4 Python helpers for message formatting — apply_chat_template() may mis-tokenize reasoning mode prompts.
Caveats
- The 51%-400% throughput range is wide. Actual gains depend on your prompt distribution, hardware, and concurrency — not just the cherry-picked benchmark cells.
- This is not a quality upgrade. When drafts get rejected, you pay verification overhead. Worst case can be slower than baseline decoding.
- V4-Pro is a 1.6T/49B activated MoE model with 1M context. Adding a draft module on top means significant VRAM requirements.
- DeepSpec's data preparation is datacenter-scale. The default cache for a Qwen3-4B target requires ~38 TB of storage.
- Engine support for new draft modules often takes time. Pin your vLLM/SGLang versions and read release notes before cutting over.
Who Should Care
If you self-host V4-Pro or V4-Flash and pay per GPU-hour, DSpark lets you serve more requests on the same hardware — especially for high-throughput workloads like batch evals, synthetic data generation, or concurrent chat.
If you only use DeepSeek's hosted API, wait for their backend upgrade. The API model IDs haven't changed, so there's nothing to reconfigure on your end.




