3arXiv cs.LG (Machine Learning)·5d ago

HumP-KD: Uncertainty-aware multi-stage knowledge distillation for efficient fire classification

Researchers propose HumP-KD, a knowledge distillation framework that compresses two heterogeneous transformer teachers (Swin-Tiny and ViT-Base) into a lightweight MobileViT-S student for real-time fire classification. The student model achieves 0.9876 mean F1 on a 31K-image dataset while retaining only 4.94M parameters—a 5.7× reduction over Swin-Tiny—and runs at 37.72 CPU FPS. The framework combines hierarchical feature alignment, spatial attention masking, and progressive multi-stage distillation to maintain accuracy under degraded visual conditions.

Inference Economics FlameVision HumP-KD Swin-Tiny MobileViT-S ViT-Base

Related guides (1)

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Open-sourcing Knowledge Distillation Code and Weights of SD-Small and SD-Tiny

Hugging Face has open-sourced knowledge distillation code and model weights for two compressed variants of Stable Diffusion: SD-Small and SD-Tiny. These distilled models are smaller and faster than the original Stable Diffusion, targeting inference efficiency. The release includes both the trained weights and the distillation training code, enabling the community to reproduce or extend the work.

Open Weights Progress Inference Economics SD-Tiny knowledge distillation SD-Small +3 more

5arXiv · cs.AI·1mo ago·source ↗

Distilling Tabular Foundation Models for Structured Health Data

This paper investigates knowledge distillation from tabular foundation models (TFMs) to lightweight student models for healthcare applications. The authors address context leakage in in-context TFMs via stratified out-of-fold teacher labeling, evaluating across 19 healthcare datasets, 6 TFM teachers, and 4 student families. Distilled students retain at least 90% of teacher AUC while running 26× faster on CPU, with preserved calibration and fairness properties. Multi-teacher ensembles do not consistently outperform the best single teacher.

Evaluation and Benchmarking Inference Economics knowledge distillation Stratified Out-of-Fold Teacher Labeling AUC +2 more

6arXiv · cs.LG·26d ago·source ↗

Strong Teacher Not Needed? On Distillation in LLM Pretraining

This paper challenges the conventional assumption that knowledge distillation requires a stronger teacher to produce better students. Through systematic variation of architecture sizes and training token budgets, the authors find that even small, undertrained teachers can improve larger student models when language modeling and distillation losses are properly mixed. Counterintuitively, stronger teachers can saturate or reverse distillation gains, and distillation benefits generalization more than in-domain fitting.

Training Infrastructure Frontier Model Releases knowledge distillation Language Modeling Loss Weak-to-Strong Distillation +2 more

5arXiv · cs.LG·8d ago·source ↗

Analysis of on-policy distillation reveals sparse, geometrically structured parameter updates

A new arXiv paper analyzes on-policy distillation (OPD) — a post-training method combining on-policy student trajectories with dense teacher supervision — across language and vision-language model pairs. The authors find that OPD updates are coordinate-sparse and distributed across layers (FFN-heavy), and that training only the discovered sparse subnetwork recovers near-full performance. Geometrically, updates are numerically full-rank but spectrally concentrated, falling disproportionately on near-zero weight coordinates, suggesting OPD retains distinct geometric signatures rather than behaving like ordinary dense parameter rewriting.

Evaluation and Benchmarking Alignment and RLHF on-policy distillation AdamW Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

4Openai Blog·1mo ago·source ↗

Semi-supervised knowledge transfer for deep learning from private training data

OpenAI published research on semi-supervised knowledge transfer techniques for training deep learning models on private data, an early contribution to privacy-preserving machine learning. The work addresses how to leverage private training data without exposing sensitive information, using knowledge distillation-style approaches. This is a 2016 archival post surfaced from OpenAI's blog.

AI Safety Research knowledge distillation Differential Privacy PATE (Private Aggregation of Teachers for Ensembles)+1 more

6arXiv · cs.CL·1mo ago·source ↗

Vision-OPD: On-Policy Self-Distillation for Fine-Grained Visual Understanding in MLLMs

Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.

Evaluation and Benchmarking Agent and Tool Ecosystem Multimodal Large Language Models Thinking-with-Images on-policy self-distillation +4 more

4arXiv · cs.LG·47h ago·source ↗

UNIEGO: Hierarchical multi-teacher distillation for unified egocentric video representation

Researchers introduce UNIEGO, an egocentric video encoder trained via a hierarchical multi-teacher distillation framework using nine teachers spanning ego-exo viewpoints, RGB/depth/skeleton modalities, and four foundation models. A key contribution is the interposition of Proxy models that translate heterogeneous teacher knowledge into a homogeneous space, followed by Selective Proxy Distillation (SPD) which adaptively selects reliable supervision signals per training sample. UNIEGO achieves state-of-the-art results on action recognition, video retrieval, and action segmentation across three ego-exo benchmarks. The work addresses a practical deployment constraint: the unified model runs from egocentric video alone despite being trained with multi-modal, multi-viewpoint supervision.

Evaluation and Benchmarking Multimodal Progress Selective Proxy Distillation UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning UNIEGO

5arXiv · cs.AI·1mo ago·source ↗

CARV: Compute-Aware Variance Reduction for Diffusion Teacher Gradient Estimation

CARV is a hierarchical Monte Carlo estimation framework that reduces gradient variance when using frozen pretrained diffusion models as teachers in downstream pipelines such as text-to-3D distillation and data attribution. The approach amortizes expensive upstream computation (rendering, simulation, encoding) over cheap diffusion-noise resamples, augmented by timestep importance sampling and stratified-inverse-CDF construction. In text-to-3D experiments, CARV delivers 2–3× effective compute multipliers; in single-step distillation, it cuts gradient variance by an order of magnitude but does not improve FID, revealing that MC variance is not the bottleneck in that regime.

Inference Economics Multimodal Progress Model Distillation CARV importance sampling +4 more