5arXiv cs.AI (Artificial Intelligence)·1mo ago

Distilling Tabular Foundation Models for Structured Health Data

This paper investigates knowledge distillation from tabular foundation models (TFMs) to lightweight student models for healthcare applications. The authors address context leakage in in-context TFMs via stratified out-of-fold teacher labeling, evaluating across 19 healthcare datasets, 6 TFM teachers, and 4 student families. Distilled students retain at least 90% of teacher AUC while running 26× faster on CPU, with preserved calibration and fairness properties. Multi-teacher ensembles do not consistently outperform the best single teacher.

Evaluation and Benchmarking Inference Economics Enterprise Deployment Patterns knowledge distillation Stratified Out-of-Fold Teacher Labeling AUC Tabular Foundation Models (TFMs)

Related guides (4)

knowledge distillationConcept

Knowledge Distillation: Compressing Model Intelligence into Smaller, Faster Successors

Read asIn-depth

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost Structure of Running AI Models in Production

Read asIn-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·1mo ago·source ↗

Ensembling Tabular Foundation Models: A Diversity Ceiling and a Calibration Trap

This paper benchmarks six ensemble strategies across six tabular foundation models (TFMs) on 153 OpenML classification tasks, finding that ensembling provides minimal gains over the best single TFM. The best ensemble strategy (two-level cascade stacking) achieves only +0.18% accuracy improvement at 253× the compute cost. A key finding is that logistic-regression meta-learner stacking improves accuracy while severely degrading calibration (log-loss), because sharpening class boundaries destroys probability estimates. The authors recommend greedy ensemble selection as the practical default.

Evaluation and Benchmarking Enterprise Deployment Patterns Q-statistic Greedy Ensemble Selection Friedman-Nemenyi Test +3 more

6arXiv · cs.LG·26d ago·source ↗

Strong Teacher Not Needed? On Distillation in LLM Pretraining

This paper challenges the conventional assumption that knowledge distillation requires a stronger teacher to produce better students. Through systematic variation of architecture sizes and training token budgets, the authors find that even small, undertrained teachers can improve larger student models when language modeling and distillation losses are properly mixed. Counterintuitively, stronger teachers can saturate or reverse distillation gains, and distillation benefits generalization more than in-domain fitting.

Training Infrastructure Frontier Model Releases knowledge distillation Language Modeling Loss Weak-to-Strong Distillation +2 more

7arXiv · cs.AI·29d ago·source ↗

Foundation Model for Wearable Health Data Pretrained on 1 Trillion Minutes from 5 Million Participants

Researchers propose a large-scale foundation model for wearable health data, pretrained on over one trillion minutes of unlabeled sensor signals from five million participants. The model demonstrates systematic performance improvements across 35 health prediction tasks spanning cardiovascular, metabolic, sleep, and mental health domains, with joint scaling of model capacity and data volume. A 'classroom' of LLM agents autonomously searches downstream predictive head configurations, and the resulting embeddings are integrated into a Personal Health Agent validated by 1,860 clinician ratings. The work establishes label-efficient few-shot learning and generative capabilities for daily health metric estimation.

Frontier Model Releases Evaluation and Benchmarking LLM Agent Classroom Personal Health Agent few-shot learning +4 more

3arXiv · cs.LG·5d ago·source ↗

HumP-KD: Uncertainty-aware multi-stage knowledge distillation for efficient fire classification

Researchers propose HumP-KD, a knowledge distillation framework that compresses two heterogeneous transformer teachers (Swin-Tiny and ViT-Base) into a lightweight MobileViT-S student for real-time fire classification. The student model achieves 0.9876 mean F1 on a 31K-image dataset while retaining only 4.94M parameters—a 5.7× reduction over Swin-Tiny—and runs at 37.72 CPU FPS. The framework combines hierarchical feature alignment, spatial attention masking, and progressive multi-stage distillation to maintain accuracy under degraded visual conditions.

Inference Economics FlameVision HumP-KD Swin-Tiny +2 more

5Hugging Face Blog·1mo ago·source ↗

Can Foundation Models Label Data Like Humans?

This Hugging Face blog post examines whether foundation models can serve as substitutes for human annotators in RLHF data labeling pipelines. It investigates the reliability and quality of model-generated preference labels compared to human-generated ones, with implications for scalable oversight and alignment research. The analysis is framed around the Open LLM Leaderboard and RLHF methodology.

Evaluation and Benchmarking Alignment and RLHF Reinforcement Learning from Human Feedback Open LLM Leaderboard Hugging Face +1 more

4arXiv · cs.AI·5d ago·source ↗

Benchmark of deep learning architectures for multi-horizon behavioural forecasting in mobile health

A new arXiv preprint benchmarks six deep learning architectures, two zero-shot foundation models, and statistical baselines on multi-horizon behavioural forecasting from wearable and smartphone data across 800+ participants. Key findings include: no single architecture dominates (PatchTST leads among trained models), TimesFM matches or exceeds trained models zero-shot especially in low-data regimes, and participant-level fine-tuning reduces per-feature RMSE by 16–60%. The study is the first to jointly evaluate modern deep learning, foundation models, and personalisation for this domain.

Evaluation and Benchmarking A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health TimesFM TCN +1 more

5Hugging Face Blog·1mo ago·source ↗

Open-sourcing Knowledge Distillation Code and Weights of SD-Small and SD-Tiny

Hugging Face has open-sourced knowledge distillation code and model weights for two compressed variants of Stable Diffusion: SD-Small and SD-Tiny. These distilled models are smaller and faster than the original Stable Diffusion, targeting inference efficiency. The release includes both the trained weights and the distillation training code, enabling the community to reproduce or extend the work.

Open Weights Progress Inference Economics SD-Tiny knowledge distillation SD-Small +3 more

4arXiv · cs.CL·18d ago·source ↗

Sentence-Level Clinical Provenance Categorization for Multidisciplinary Hospital Summarization Using Fine-Tuned Llama-3

This pilot study presents a pipeline for categorizing sentence-level clinical provenance across multi-source hospital notes, targeting structured summarization in high-complexity settings like the NICU. The authors fine-tune Llama-3 8B and 70B models on MedSecId (MIMIC-III annotations), achieving Macro F1 above 92% in-domain. Cross-domain evaluation reveals a scale-dependent transfer effect: SFT substantially improves the 70B model (+7% Macro F1) but yields only marginal gains for the 8B model. A quantized fine-tuned 70B model outperforms its full-precision baseline while reducing compute, suggesting quantized adaptation is viable for structured clinical NLP tasks.

Inference Economics Enterprise Deployment Patterns MIMIC-III Llama 3.1 70B quantization +4 more