Entity · model

DeepSeek-R1-Distill-Qwen

modelactivedeepseek-r1-distill-qwen-1a47d1b0·5 events·first seen Jun 10, 2026

Aliases: DeepSeek-R1-Distill-Qwen, DeepSeek-R1-Distill-Qwen-14B, DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B

Co-occurring entities

AIME Token Budget Saturation and Mechanistic Early Detection of Reasoning Non-Convergence in Chain-of-Thought Models STILL-3 LoRA MADA-RL DeepScaleR Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization Qwen3-14B RACES AceReason-14B AIME 2026 SnapKV ReasonAlloc DeepSeek-R1-Distill-Llama-8B MATH-500 N-GRPO Semantic Neighbor Mixing GRPO (Group Relative Policy Optimization)

More like this (12)

DeepSeek-R1-0528-Qwen3-8B DeepSeek-R1-Distill-Llama-8B DeepSeek-Prover-V2-7B DeepSeek Reasonix DeepSeek-V4-Pro Preview DeepSeek-R1-0528 DeepSeek-R1-Lite-Preview DeepSeek-V2.5-1210 DeepSeek-V3-0324 DeepSeek-Math-V2 DeepSeek V4 DeepSeek flagship model

Recent events (5)

5arXiv · cs.CL·Jul 24, 2026·source ↗

Linear probes on chain-of-thought hidden states can predict reasoning non-convergence early

A new arXiv preprint studies the bimodal convergence pattern in chain-of-thought models like DeepSeek-R1-Distill-Qwen-7B, where generations either complete within a token budget (90.3% accuracy on AIME) or exhaust it without concluding (6.6% accuracy), with a 62% overall convergence rate. The authors train linear probes on hidden-state activations at early token positions (50-300) and find that layer-20 activations at token 150 achieve AUC 0.608, reliably above chance and outperforming behavioral baselines from token entropy and repetition statistics. The results suggest convergence fate is partially encoded in intermediate representations early in generation, pointing toward early-exit inference and adaptive compute allocation strategies. Statistical evidence is modest (permutation test p=0.063), limiting strong conclusions.

Evaluation and Benchmarking Inference Economics AIME DeepSeek-R1-Distill-Qwen Token Budget Saturation and Mechanistic Early Detection of Reasoning Non-Convergence in Chain-of-Thought Models

4arXiv · cs.CL·Jul 21, 2026·source ↗

MADA-RL: Multi-agent debate with counterfactual critic advantage for parameter-efficient reasoning in compact models

MADA-RL is a post-training framework that specializes compact LLMs (≤4B parameters) into generator and critic roles, training them with a debate-aware RL signal using LoRA adapters on only a small fraction of parameters. The central contribution is a counterfactual critic advantage that conditions the critic's reward on the generator ensemble's per-instance accuracy, explicitly incentivizing critics to correct generator errors rather than reproduce correct answers. Applied to DeepSeek-R1-Distill-Qwen-1.5B, the method achieves +2.0 percentage points on mathematical reasoning benchmarks using 16× fewer trainable parameters than fully fine-tuned baselines, placing it on the accuracy-vs-trainable-parameter Pareto front. The approach does not surpass the strongest baselines (DeepScaleR, STILL-3) trained on larger datasets, and the paper analyzes this gap directly.

Open Weights Progress Alignment and RLHF STILL-3 LoRA DeepSeek-R1-Distill-Qwen +2 more

6arXiv · cs.CL·Jun 11, 2026·source ↗

RACES framework enables recursive composition of verifiable RL environments for LLM reasoning generalization

RACES (Recursive Automated Composition for Environment Scaling) is a new framework that treats verifiable RL training environments as composable building blocks, automatically fusing them when input/output types match. The system implements 300 base environments and four composition operators (SEQUENTIAL, PARALLEL, SORT, SELECT) to generate diverse reasoning patterns at scale. Experiments show consistent gains on unseen benchmarks: DeepSeek-R1-Distill-Qwen-14B improves from 48.2 to 51.3 and Qwen3-14B from 58.8 to 61.1 averaged across six benchmarks. Notably, RACES achieves parity with 300 individual environments using only 50 base environments, suggesting strong efficiency gains over linear environment scaling.

Evaluation and Benchmarking Alignment and RLHF Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization DeepSeek-R1-Distill-Qwen Qwen3-14B +1 more

5arXiv · cs.AI·Jun 10, 2026·source ↗

ReasonAlloc: Hierarchical KV Cache Budget Allocation for Long-CoT Reasoning Models

ReasonAlloc is a training-free framework that reframes decoding-time KV cache compression as a hierarchical budget allocation problem, operating at both layer-wise (offline) and head-wise (online) levels. The method identifies an architecture-driven pattern called the 'Reasoning Wave' to guide layer preallocation, then dynamically reallocates to information-rich heads during decoding. Evaluated on MATH-500 and AIME 2024 using DeepSeek-R1-Distill and AceReason models, it outperforms uniform-budget baselines (R-KV, SnapKV, Pyramid-RKV) especially at small budgets of 128–512 tokens, with negligible overhead.

Frontier Model Releases Inference Economics AceReason-14B AIME 2026 SnapKV +4 more

4arXiv · cs.CL·Jun 10, 2026·source ↗

N-GRPO: Semantic Neighbor Mixing for Improved Policy Optimization in LLM Reasoning

A new arXiv preprint introduces N-GRPO, an exploration strategy for the GRPO reinforcement learning framework that improves solution diversity during rollout by mixing embeddings of anchor tokens with their nearest semantic neighbors rather than using token-level sampling or random noise. The method is evaluated on DeepSeek-R1-Distill-Qwen models of various sizes and shows consistent improvements on math reasoning benchmarks plus out-of-distribution generalization. The work targets a known limitation in RLHF-style training: redundant rollout trajectories that reduce effective learning signal.

Evaluation and Benchmarking Alignment and RLHF N-GRPO DeepSeek-R1-Distill-Qwen Semantic Neighbor Mixing +1 more