Foundation Model for Wearable Health Data Pretrained on 1 Trillion Minutes from 5 Million Participants
Researchers propose a large-scale foundation model for wearable health data, pretrained on over one trillion minutes of unlabeled sensor signals from five million participants. The model demonstrates systematic performance improvements across 35 health prediction tasks spanning cardiovascular, metabolic, sleep, and mental health domains, with joint scaling of model capacity and data volume. A 'classroom' of LLM agents autonomously searches downstream predictive head configurations, and the resulting embeddings are integrated into a Personal Health Agent validated by 1,860 clinician ratings. The work establishes label-efficient few-shot learning and generative capabilities for daily health metric estimation.
Related guides (4)

Enterprise Deployment PatternsTopic guide
Enterprise Deployment Patterns: From LLM Demo to Production Reality
Related events (8)
Benchmark of deep learning architectures for multi-horizon behavioural forecasting in mobile health
A new arXiv preprint benchmarks six deep learning architectures, two zero-shot foundation models, and statistical baselines on multi-horizon behavioural forecasting from wearable and smartphone data across 800+ participants. Key findings include: no single architecture dominates (PatchTST leads among trained models), TimesFM matches or exceeds trained models zero-shot especially in low-data regimes, and participant-level fine-tuning reduces per-feature RMSE by 16–60%. The study is the first to jointly evaluate modern deep learning, foundation models, and personalisation for this domain.
Distilling Tabular Foundation Models for Structured Health Data
This paper investigates knowledge distillation from tabular foundation models (TFMs) to lightweight student models for healthcare applications. The authors address context leakage in in-context TFMs via stratified out-of-fold teacher labeling, evaluating across 19 healthcare datasets, 6 TFM teachers, and 4 student families. Distilled students retain at least 90% of teacher AUC while running 26× faster on CPU, with preserved calibration and fairness properties. Multi-teacher ensembles do not consistently outperform the best single teacher.
Fine-tuning LLMs to passively estimate depression severity from AI mental health conversations
Researchers fine-tune a Qwen3.5-27B model with a regression head to predict PHQ-9 depression severity scores directly from AI mental health app conversation transcripts, eliminating the need for explicit self-report completion. The training set of 6,283 users combines 3,111 ground-truth labels with pseudolabels generated by Claude Opus and iterative intermediate models. On a held-out test of 842 users, the best model achieves MAE=2.6, Pearson r=0.80, and AUC=0.91 at the clinical PHQ-9≥10 threshold, with AUC>0.87 across all severity thresholds. The work demonstrates a passive, continuous symptom-monitoring approach that could reduce response bias in mental health platforms.
Can Foundation Models Label Data Like Humans?
This Hugging Face blog post examines whether foundation models can serve as substitutes for human annotators in RLHF data labeling pipelines. It investigates the reliability and quality of model-generated preference labels compared to human-generated ones, with implications for scalable oversight and alignment research. The analysis is framed around the Open LLM Leaderboard and RLHF methodology.
The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare
Hugging Face has launched the Open Medical-LLM Leaderboard, a public benchmark for evaluating large language models on healthcare and medical tasks. The leaderboard aggregates performance across multiple medical question-answering datasets to enable standardized comparison of open-weight models in clinical and biomedical domains. This initiative aims to accelerate progress in medical AI by providing transparent, reproducible evaluation infrastructure.
DeepMind Launches 27B Parameter Gemma-Based Foundation Model for Single-Cell Analysis
DeepMind has released a new 27 billion parameter foundation model built on the Gemma open-model family, specifically designed for single-cell biological analysis. The model contributed to the discovery of a new potential cancer therapy pathway. This represents a significant application of large language model architecture to computational biology and genomics research.
OpenMedReason: Large-scale multimodal medical reasoning corpus with 450K instances for clinical VLM training
Researchers introduce OpenMedReason, a 450K-instance open multimodal medical reasoning corpus with reasoning traces derived from human-authored biomedical literature rather than synthetic chains of thought. The dataset covers diverse medical imaging modalities and is paired with OpenMedReason-Bench, a held-out benchmark evaluating LVLMs on perception, medical knowledge, and rationale axes. Training with OpenMedReason yields a 20% average VQA accuracy improvement over base models and achieves performance within 4.2% of leading comparable-scale medical VLMs. Both the dataset and code are publicly released.
Clinically grounded privacy evaluation framework reveals high memorization risk in medical LMs
Researchers introduce a tiered adversarial framework for evaluating privacy leakage in medical language models, moving beyond simple training-text recovery to realistic clinical threat models. Applied to an LM pretrained on 378k clinical notes, the framework finds that routine encounter metadata (name, DOB, provider, visit date) elicits high verbatim memorization and sensitive-diagnosis recovery (AUROC 0.91 for abortion, 0.81 for HIV). The study also finds that exact-match memorization overstates disclosure risk because 36% of memorized tokens reflect templated documentation. The work provides a practical contextual privacy evaluation methodology for medical LMs trained on longitudinal patient data.


