DeepSeek releases DSpark: speculative decoding system for LLM inference acceleration
DeepSeek published DSpark, a paper describing a speculative decoding system designed to accelerate LLM inference. The paper is hosted on DeepSeek's GitHub and attracted significant Hacker News engagement (598 points, 228 comments), suggesting meaningful community interest. Speculative decoding is an active inference optimization technique, and a release from DeepSeek carries weight given their track record on inference efficiency.
Related guides (3)
Related events (8)
DeepSeek releases DeepSeek-V4-Pro-DSpark on Hugging Face
DeepSeek has published a new model checkpoint, DeepSeek-V4-Pro-DSpark, on Hugging Face under the text-generation category. The model uses the deepseek_v4 architecture and supports FP8 and 8-bit quantization formats. The 'DSpark' suffix suggests a variant or specialized version of the DeepSeek V4 Pro line, though no accompanying technical documentation is visible in this listing.
DeepSeek releases DeepSeek-V4-Flash-DSpark on Hugging Face
DeepSeek has published a new model checkpoint, DeepSeek-V4-Flash-DSpark, on Hugging Face under the deepseek_v4 model family. The release is tagged as a text-generation model with FP8 and 8-bit support, suggesting an efficiency-optimized variant. The 'Flash' and 'DSpark' naming implies a faster or distilled derivative of the DeepSeek V4 flagship. Download counts are near zero, indicating a very recent upload.
DeepSeek releases EAGLE3 speculative decoding draft model for Qwen3-4B
DeepSeek published eagle3_qwen3_4b_ttt7 on Hugging Face, a draft model for EAGLE3 speculative decoding targeting the Qwen3-4B base model. EAGLE3 is DeepSeek's third-generation speculative decoding framework designed to accelerate inference by predicting future tokens with a lightweight draft model. The release is a narrow inference-optimization artifact with zero downloads and likes at time of indexing, suggesting it is very fresh or experimental.
DeepSeek releases EAGLE3 speculative decoding draft model for Qwen3-8B
DeepSeek published eagle3_qwen3_8b_ttt7 on Hugging Face, a draft model for EAGLE3 speculative decoding targeting the Qwen3-8B base model. EAGLE3 is DeepSeek's third-generation speculative decoding framework designed to accelerate inference by predicting future tokens with a lightweight draft head. The release is a narrow inference optimization artifact with minimal engagement at time of indexing.
DeepSeek Releases V3.2-Exp with Sparse Attention Architecture and 50%+ API Price Cut
DeepSeek has released DeepSeek-V3.2-Exp, an experimental model built on V3.1-Terminus that introduces DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism designed to improve long-context performance and reduce compute costs during training and inference. Benchmarks indicate V3.2-Exp performs on par with V3.1-Terminus while achieving efficiency gains. The release is accompanied by a 50%+ API price reduction effective immediately, open-weights release on Hugging Face, a technical report, and GPU kernel code in TileLang and CUDA.
DeepSeek releases Eagle3 speculative decoding draft model for Qwen3-14B
DeepSeek published eagle3_qwen3_14b_ttt7 on Hugging Face, a draft model for the Eagle3 speculative decoding framework targeting Qwen3-14B. Eagle3 is DeepSeek's third-generation speculative decoding approach designed to accelerate inference. The release is a narrow infrastructure artifact with zero downloads and likes at time of indexing, suggesting it is very early or experimental.
DeepSeek Reasonix: Native Coding Agent with High Caching and Low Cost
DeepSeek Reasonix is a coding agent built natively on DeepSeek models, emphasizing high prompt caching rates and low inference cost. The project attracted significant Hacker News engagement (349 points, 171 comments), suggesting community interest in cost-efficient agentic coding workflows. It appears to be an open-source or community-developed tool rather than an official DeepSeek Labs release.
SimSD: Speculative Decoding Adapted for Diffusion Language Models
SimSD introduces a training-free speculative decoding algorithm for diffusion large language models (dLLMs), which previously could not use standard token-level speculative decoding due to their bidirectional attention and masked language modeling formulation. The method uses a plug-and-play masking strategy that introduces reference tokens from a draft model and a custom attention mask, enabling valid logit computation for drafted tokens in a single forward pass. Evaluated on SDAR-family dLLMs across four benchmarks, SimSD achieves up to 7.46x decoding throughput improvement while maintaining or improving generation quality. The approach is compatible with other acceleration techniques such as KV cache and blockwise decoding.


