HumP-KD: Uncertainty-aware multi-stage knowledge distillation for efficient fire classification
Researchers propose HumP-KD, a knowledge distillation framework that compresses two heterogeneous transformer teachers (Swin-Tiny and ViT-Base) into a lightweight MobileViT-S student for real-time fire classification. The student model achieves 0.9876 mean F1 on a 31K-image dataset while retaining only 4.94M parameters—a 5.7× reduction over Swin-Tiny—and runs at 37.72 CPU FPS. The framework combines hierarchical feature alignment, spatial attention masking, and progressive multi-stage distillation to maintain accuracy under degraded visual conditions.
Related guides (1)
Related events (8)
Open-sourcing Knowledge Distillation Code and Weights of SD-Small and SD-Tiny
Hugging Face has open-sourced knowledge distillation code and model weights for two compressed variants of Stable Diffusion: SD-Small and SD-Tiny. These distilled models are smaller and faster than the original Stable Diffusion, targeting inference efficiency. The release includes both the trained weights and the distillation training code, enabling the community to reproduce or extend the work.
Distilling Tabular Foundation Models for Structured Health Data
This paper investigates knowledge distillation from tabular foundation models (TFMs) to lightweight student models for healthcare applications. The authors address context leakage in in-context TFMs via stratified out-of-fold teacher labeling, evaluating across 19 healthcare datasets, 6 TFM teachers, and 4 student families. Distilled students retain at least 90% of teacher AUC while running 26× faster on CPU, with preserved calibration and fairness properties. Multi-teacher ensembles do not consistently outperform the best single teacher.
Strong Teacher Not Needed? On Distillation in LLM Pretraining
This paper challenges the conventional assumption that knowledge distillation requires a stronger teacher to produce better students. Through systematic variation of architecture sizes and training token budgets, the authors find that even small, undertrained teachers can improve larger student models when language modeling and distillation losses are properly mixed. Counterintuitively, stronger teachers can saturate or reverse distillation gains, and distillation benefits generalization more than in-domain fitting.
Analysis of on-policy distillation reveals sparse, geometrically structured parameter updates
A new arXiv paper analyzes on-policy distillation (OPD) — a post-training method combining on-policy student trajectories with dense teacher supervision — across language and vision-language model pairs. The authors find that OPD updates are coordinate-sparse and distributed across layers (FFN-heavy), and that training only the discovered sparse subnetwork recovers near-full performance. Geometrically, updates are numerically full-rank but spectrally concentrated, falling disproportionately on near-zero weight coordinates, suggesting OPD retains distinct geometric signatures rather than behaving like ordinary dense parameter rewriting.
Semi-supervised knowledge transfer for deep learning from private training data
OpenAI published research on semi-supervised knowledge transfer techniques for training deep learning models on private data, an early contribution to privacy-preserving machine learning. The work addresses how to leverage private training data without exposing sensitive information, using knowledge distillation-style approaches. This is a 2016 archival post surfaced from OpenAI's blog.
Vision-OPD: On-Policy Self-Distillation for Fine-Grained Visual Understanding in MLLMs
Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.
UNIEGO: Hierarchical multi-teacher distillation for unified egocentric video representation
Researchers introduce UNIEGO, an egocentric video encoder trained via a hierarchical multi-teacher distillation framework using nine teachers spanning ego-exo viewpoints, RGB/depth/skeleton modalities, and four foundation models. A key contribution is the interposition of Proxy models that translate heterogeneous teacher knowledge into a homogeneous space, followed by Selective Proxy Distillation (SPD) which adaptively selects reliable supervision signals per training sample. UNIEGO achieves state-of-the-art results on action recognition, video retrieval, and action segmentation across three ego-exo benchmarks. The work addresses a practical deployment constraint: the unified model runs from egocentric video alone despite being trained with multi-modal, multi-viewpoint supervision.
CARV: Compute-Aware Variance Reduction for Diffusion Teacher Gradient Estimation
CARV is a hierarchical Monte Carlo estimation framework that reduces gradient variance when using frozen pretrained diffusion models as teachers in downstream pipelines such as text-to-3D distillation and data attribution. The approach amortizes expensive upstream computation (rendering, simulation, encoding) over cheap diffusion-noise resamples, augmented by timestep importance sampling and stratified-inverse-CDF construction. In text-to-3D experiments, CARV delivers 2–3× effective compute multipliers; in single-step distillation, it cuts gradient variance by an order of magnitude but does not improve FID, revealing that MC variance is not the bottleneck in that regime.
