Almanac
Guide · In-depth

DeepSeek V4: Open-Weights Frontier MoE with 1M-Token Context and Aggressive Pricing

DeepSeek V4In-depthactive·v1 · live·generated 6d ago

Part of these paths

TL;DRDeepSeek V4 is the latest generation of DeepSeek's open-weights model family, arriving as two Mixture-of-Experts variants — a flagship Pro and a cost-optimized Flash — both built around a novel sparse attention architecture that makes one-million-token context windows practical for agentic workloads. The release continues DeepSeek's pattern of pairing frontier-class capability with aggressive, downward-pressuring pricing, while landing in a geopolitically charged environment where the lab's distillation practices and hardware partnerships have drawn scrutiny from U.S. regulators and competitors alike.

Key takeaways

  • V4-Pro packs 1.6T total / 49B active parameters; V4-Flash runs 284B total / 13B active — both default to 1M-token context via DeepSeek Sparse Attention (DSA) and Token-wise compression.
  • V4-Pro claims open-source SOTA on agentic coding benchmarks; an independent industry analysis noted it trails leading open and closed models on aggregate benchmarks.
  • DeepSeek made its V4 Pro price cut permanent at 75% off, continuing a multi-generation pattern of halving inference costs with each major release.
  • All four V4 variants (Pro, Pro-Base, Flash, Flash-Base) were open-sourced on Hugging Face with FP8 and 8-bit quantization support on April 22, 2026.
  • DeepSeek gave Huawei weeks of pre-release hardware-optimization access to V4 while denying the same to Nvidia and AMD — a signal of deepening geopolitical supply-chain fragmentation.
  • Anthropic publicly accused DeepSeek of conducting industrial-scale distillation attacks against Claude via ~24,000 fraudulent accounts, framing the practice as a national security concern.

What DeepSeek V4 is

DeepSeek V4 is the fourth major generation of DeepSeek's open-weights large language model series, released as a preview in two Mixture-of-Experts (MoE) configurations: V4-Pro (1.6T total parameters, 49B active per token) and V4-Flash (284B total, 13B active). Both variants ship with a one-million-token context window enabled by default — a capability made practical by two architectural innovations introduced in the V3.x line and matured here: DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism, and Token-wise compression. The API is live with OpenAI and Anthropic format compatibility, and all four weight variants (Pro, Pro-Base, Flash, Flash-Base) were released on Hugging Face on April 22, 2026, with FP8 and 8-bit quantization support.

Lineage and architectural evolution

V4 is the culmination of a rapid iterative sequence. DeepSeek-V3 (671B / 37B active, trained on 14.8T tokens) established the lab's MoE baseline and introduced Multi-head Latent Attention (MLA), which compresses the KV cache enough to enable disk-based context caching — a 90% cost reduction on cache hits. V3.1 added hybrid think/non-think inference and agent tool-use improvements. V3.2-Exp introduced DSA experimentally alongside a 50%+ API price cut. V3.2 and V3.2-Speciale integrated chain-of-thought reasoning directly into tool-use workflows, trained on a new agent data synthesis pipeline covering 1,800+ environments and 85k+ complex instructions. V4 consolidates these advances and scales them into the 1M-token regime.

The parallel R1 reasoning line — which achieved parity with OpenAI o1 on math, code, and reasoning benchmarks under a permissive MIT license — fed back into the V-series through the V3.1 hybrid thinking mode and informs V4's agentic reasoning posture.

Capability claims and independent assessment

DeepSeek claims V4-Pro achieves open-source state-of-the-art on agentic coding benchmarks and "world-class" math/STEM/coding performance rivaling top closed-source models. V4-Flash is positioned as near-parity reasoning at lower cost and latency. However, an independent industry analysis (The Batch, April 2026) noted that V4 "trails leading open and closed models on aggregate benchmarks" — a reminder that benchmark selection matters significantly when evaluating these claims. Practitioners should treat the agentic coding claims as domain-specific rather than general-purpose superiority.

For context, contemporaneous open-weights competitors include Kimi K2.6 (1T params / 32B active, 256K context, scoring 54 vs. V4-Pro's 52 on the Artificial Analysis Intelligence Index) and Qwen3-235B-A22B, which claims competitive performance against DeepSeek-R1 on coding and math.

Pricing strategy

DeepSeek's pricing trajectory is as notable as its architecture. V3 launched at $0.27/$1.10 per million input/output tokens. V3.2-Exp came with a 50%+ cut. V4 Pro received a 75% permanent price reduction — confirmed in May 2026 after initially appearing temporary. This sustained downward pressure on inference pricing has forced competitive responses across the market and is a defining characteristic of DeepSeek's go-to-market approach.

Geopolitical and supply-chain context

V4 arrived in a charged environment. Before public release, DeepSeek gave Huawei several weeks of pre-release hardware-optimization access for V4 while denying equivalent access to Nvidia and AMD — a deliberate signal of alignment with China's domestic chip ecosystem. Reuters reported (with unverified sourcing) that a Trump administration official claimed V4 was trained on Nvidia's most advanced chips despite U.S. export controls.

More consequentially for practitioners, Anthropic publicly accused DeepSeek of conducting industrial-scale distillation attacks against Claude: generating over 16 million exchanges through approximately 24,000 fraudulent accounts to harvest Claude's outputs — targeting agentic reasoning, tool use, coding, and chain-of-thought generation specifically. A separate ChinaTalk/CISPA report documented a broader gray-market API proxy ecosystem that feeds such training pipelines. The White House acknowledged the distillation threat in an April 2026 memo. These accusations do not change V4's technical properties, but they are material context for organizations evaluating supply-chain and compliance risk when deploying or fine-tuning open-weights models derived from this lineage.

Safety and alignment considerations

Research published in June 2026 found that fine-tuning models (including DeepSeek-V3.1) on verbatim-generation tasks can re-enable memorized text strings suppressed by alignment training, achieving up to 91.9% verbatim book reproduction — a finding with direct implications for organizations offering fine-tuning APIs on top of V4 weights. Separately, a cross-lingual behavioral audit found DeepSeek-R1 becomes less coercive when operating in Turkish versus English in adversarial geopolitical simulations, suggesting language-dependent behavioral variation that practitioners deploying in multilingual contexts should probe.

Ecosystem and deployment

The V4 API maintains OpenAI and Anthropic format compatibility, lowering migration friction. Legacy V3-series endpoints are scheduled for retirement in July 2026. The open-weights release supports self-hosted deployment with FP8 precision and standard inference frameworks. Community uptake was rapid: V4-Pro accumulated over 4.3 million Hugging Face downloads and V4-Flash-Base over 66,000 downloads shortly after release.

The broader DeepSeek ecosystem — including the R1 reasoning line, disk-based context caching, and the V3.x agent data synthesis pipeline — positions V4 as infrastructure for agentic workloads rather than a chat-first model, consistent with the lab's stated framing of V3.1 as "the first step toward the agent era."

Where it's heading

The retirement of legacy endpoints in July 2026 and the permanent pricing cuts suggest DeepSeek is consolidating around V4 as its production baseline. The DSA architecture introduced in V3.2-Exp and carried into V4 is the technical foundation for whatever comes next in long-context scaling. The geopolitical trajectory — Huawei optimization, export control scrutiny, distillation accusations — will likely constrain or shape how V4 and its successors are received in enterprise and government procurement contexts outside China, regardless of raw benchmark performance.

DeepSeek V-series architectural evolution toward V4

DeepSeek V-series lineage at a glance

ModelParams (total / active)ContextKey capabilityPricing signal
V3671B / 37B60 tok/s; open-source frontier baseline$0.27/$1.10 per M tok
V3.1128KHybrid think/non-think; agent tool-use
V3.2-ExpIntroduces DSA; 50%+ API price cut50%+ cut vs V3.1
V3.2 / V3.2-SpecialeCoT in tool-use; gold-medal math (Speciale)
V4-Flash284B / 13B1MNear-parity reasoning, lower cost/latency
V4-Pro1.6T / 49B1MAgentic coding SOTA (claimed); DSA75% permanent cut

Cells marked — indicate the events bundle does not disclose that value. Active-parameter counts reflect MoE sparse activation.

Timeline

  1. V4-Pro, V4-Pro-Base, V4-Flash, V4-Flash-Base released on Hugging Face

  2. Industry analysis notes V4 trails leading models on aggregate benchmarks

  3. DeepSeek makes V4 Pro 75% price cut permanent

  4. Anthropic accuses DeepSeek of industrial-scale distillation attacks via ~24,000 fraudulent accounts

Related topics

Hugging FaceAnthropicOpenAINVIDIAMoonshot AIOpen R1DeepSeek-R1-0528

FAQ

What are the two V4 variants and when should I choose each?

V4-Pro (1.6T total / 49B active parameters) targets maximum capability on agentic coding and complex reasoning; V4-Flash (284B total / 13B active) offers near-parity performance at lower cost and latency, suitable for high-throughput or latency-sensitive workloads.

How does DeepSeek Sparse Attention (DSA) enable 1M-token context?

DSA is a fine-grained sparse attention mechanism introduced experimentally in V3.2-Exp and carried into V4; combined with Token-wise compression, it reduces the compute and memory cost of attending over very long sequences, making 1M-token context the default rather than an expensive option.

Is DeepSeek V4 truly open-source?

The weights for all four V4 variants (Pro, Pro-Base, Flash, Flash-Base) are publicly available on Hugging Face with FP8 and 8-bit quantization support, though 'open-source' licensing terms should be verified against the model cards directly.

How does V4 benchmark against closed-source frontier models?

DeepSeek claims open-source SOTA on agentic coding benchmarks for V4-Pro, but an independent industry analysis in The Batch noted V4 trails leading open and closed models on aggregate benchmarks — the picture is benchmark-dependent.

What is the distillation controversy around DeepSeek?

Anthropic publicly accused DeepSeek (along with Moonshot AI and MiniMax) of generating over 16 million exchanges through ~24,000 fraudulent accounts to harvest Claude's outputs for training data — a practice Anthropic frames as a national security concern that strips safety safeguards from the resulting models.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v1live6d ago

Related guides (4)

More on DeepSeek V4 (6)

6Hugging Face Blog·1mo ago·source ↗

DeepSeek-V4: a million-token context that agents can actually use

A Hugging Face blog post discusses DeepSeek-V4, highlighting its million-token context window as a practically usable capability for agentic applications. The post appears to analyze or announce DeepSeek-V4's long-context features in the context of agent workflows. No article body was available for deeper analysis.

6Deepseek News·1mo ago·source ↗

DeepSeek API Major Upgrade: Function Calling, FIM, Chat Prefix Completion, JSON Output, and 8K Token Limit

DeepSeek has released a significant API update adding Function Calling (up to 128 parallel calls, OpenAI-compatible), JSON Output, Chat Prefix Completion, and FIM (Fill-In-the-Middle) Completion to both deepseek-chat and deepseek-coder models. The update also raises the max_tokens ceiling to 8K in the Beta API. Several features are in Beta and will be open-sourced once stable. The Function Calling and JSON Output implementations are explicitly designed to be compatible with the OpenAI API.

7Deepseek News·1mo ago·source ↗

DeepSeek API Introduces Context Caching on Disk, Cutting Token Prices by ~90%

DeepSeek has launched a disk-based context caching service for its API, reducing cache-hit token pricing to $0.014 per million tokens versus $0.14 for cache misses—a 90% cost reduction. The system requires no code changes, runs automatically for prefix-matched inputs, and reduces first-token latency from ~13s to ~500ms on 128K prompts. DeepSeek attributes the feasibility of disk caching to the compact KV cache produced by its MLA (Multi-head Latent Attention) architecture in DeepSeek V2, which it claims makes it the first LLM API provider to deploy extensive disk caching at scale. The service supports up to 1 trillion tokens per day with no concurrency limits.

6Deepseek News·1mo ago·source ↗

DeepSeek-V2.5: Merged Open-Source Model Combining General and Coding Capabilities

DeepSeek has released DeepSeek-V2.5, an open-source model that merges DeepSeek-V2-Chat-0628 and DeepSeek-Coder-V2-0724 into a single unified model. The release improves general conversational capabilities, coding performance, instruction-following, and writing tasks while also strengthening safety properties—raising the overall safety score from 74.4% to 82.6% and reducing safety spillover rate from 11.3% to 4.6%. The model is available via backward-compatible API endpoints (deepseek-chat and deepseek-coder) and on HuggingFace, retaining features like Function Calling, FIM completion, and JSON output. Benchmark results show improvements on HumanEval Python and LiveCodeBench, though SWE-verified performance remains an acknowledged weak area.

7Deepseek News·1mo ago·source ↗

DeepSeek-R1-Lite-Preview Launched with o1-Level Reasoning Performance

DeepSeek has released DeepSeek-R1-Lite-Preview, a reasoning-focused model claiming o1-preview-level performance on AIME and MATH benchmarks. The model features a transparent, real-time chain-of-thought process and demonstrates inference scaling behavior where longer reasoning chains yield better results. DeepSeek has indicated that open-source model weights and a full API are forthcoming. The model is currently accessible via chat.deepseek.com.

9Deepseek News·1mo ago·source ↗

DeepSeek-V3: 671B MoE Open-Source Model with 3x Speed Improvement

DeepSeek releases V3, a 671B parameter Mixture-of-Experts model with 37B activated parameters, trained on 14.8T tokens. The model runs at 60 tokens/second (3x faster than V2) and is fully open-source with weights and paper released. API pricing is set at $0.27/M input tokens and $1.10/M output tokens starting February 8, positioning it as a low-cost frontier alternative. DeepSeek signals future multimodal capabilities in the ecosystem.