Entity · technique

Self-Distillation

techniqueactiveself-distillation-e6a4f103·2 events·first seen May 19, 2026

Aliases: Self-Distillation

Co-occurring entities

RLVR GRPO Qwen2.5-7B-Instruct-1M Freebase Complex WebQuestions ZEDA (Zero-Expert Self-Distillation Adaptation)Qwen3-30B-A3B GLM-4.7-Flash Mixture of Experts

More like this (12)

distillation on-policy self-distillation Generalized Distillation ensemble distillation Model Distillation Weak-to-Strong Distillation distillation attacks distilabel Rubric-Conditioned Self-Distillation ZEDA (Zero-Expert Self-Distillation Adaptation)Skill-Conditioned Gated Self-Distillation (SGSD)Distill to Detect

Recent events (2)

6arXiv · cs.CL·May 26, 2026·source ↗

Peak-Then-Collapse: RLVR Tool-Use Failures on Knowledge-Graph APIs

This paper investigates RLVR-based tool-use training (GRPO on Qwen2.5-7B-Instruct) on a minimal knowledge-graph API (Freebase over Complex WebQuestions) and documents a 'peak-then-collapse' pattern where tool-grounded answer rates rise then fall to zero within 50 steps, replicated across four seeds and seven reward designs. The authors identify a key structural difference between knowledge-graph APIs and other tool types (Python, web search, JSON): sparse, non-natural-language feedback signals (e.g., empty brackets '[]') prevent the model from recovering via pretraining-familiar error signals. A direct oracle ablation shows relation selection is not the bottleneck—95.4% of errors are retrieval-composition failures—and self-distillation reaches 40% EM at 7B, with capacity scaling to 14B yielding only marginal gains, suggesting an interface-bound ceiling.

Evaluation and Benchmarking Agent and Tool Ecosystem RLVR Self-Distillation GRPO +4 more

6arXiv · cs.CL·May 19, 2026·source ↗

ZEDA: Post-Trained MoE Models Can Skip Half Their Experts via Self-Distillation

This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a framework that converts static post-trained Mixture-of-Experts (MoE) language models into dynamic ones without pre-training from scratch. ZEDA injects parameter-free zero-output experts into each MoE layer and uses two-stage self-distillation with the original model as a frozen teacher. Applied to Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks, ZEDA eliminates over 50% of expert FLOPs with marginal accuracy loss and achieves approximately 1.20× end-to-end inference speedup, outperforming the strongest dynamic MoE baseline by 4–6 points.

Training Infrastructure Frontier Model Releases Self-Distillation ZEDA (Zero-Expert Self-Distillation Adaptation)Qwen3-30B-A3B +3 more