7arXiv cs.AI (Artificial Intelligence)·29d ago

Foundation Model for Wearable Health Data Pretrained on 1 Trillion Minutes from 5 Million Participants

Researchers propose a large-scale foundation model for wearable health data, pretrained on over one trillion minutes of unlabeled sensor signals from five million participants. The model demonstrates systematic performance improvements across 35 health prediction tasks spanning cardiovascular, metabolic, sleep, and mental health domains, with joint scaling of model capacity and data volume. A 'classroom' of LLM agents autonomously searches downstream predictive head configurations, and the resulting embeddings are integrated into a Personal Health Agent validated by 1,860 clinician ratings. The work establishes label-efficient few-shot learning and generative capabilities for daily health metric estimation.

Frontier Model Releases Evaluation and Benchmarking Enterprise Deployment Patterns Agent and Tool Ecosystem LLM Agent Classroom Personal Health Agent few-shot learning Wearable Health Foundation Model Self-Supervised Pretraining

Related guides (4)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From LLM Demo to Production Reality

Read asIn-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.AI·5d ago·source ↗

Benchmark of deep learning architectures for multi-horizon behavioural forecasting in mobile health

A new arXiv preprint benchmarks six deep learning architectures, two zero-shot foundation models, and statistical baselines on multi-horizon behavioural forecasting from wearable and smartphone data across 800+ participants. Key findings include: no single architecture dominates (PatchTST leads among trained models), TimesFM matches or exceeds trained models zero-shot especially in low-data regimes, and participant-level fine-tuning reduces per-feature RMSE by 16–60%. The study is the first to jointly evaluate modern deep learning, foundation models, and personalisation for this domain.

Evaluation and Benchmarking A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health TimesFM TCN +1 more

5arXiv · cs.AI·1mo ago·source ↗

Distilling Tabular Foundation Models for Structured Health Data

This paper investigates knowledge distillation from tabular foundation models (TFMs) to lightweight student models for healthcare applications. The authors address context leakage in in-context TFMs via stratified out-of-fold teacher labeling, evaluating across 19 healthcare datasets, 6 TFM teachers, and 4 student families. Distilled students retain at least 90% of teacher AUC while running 26× faster on CPU, with preserved calibration and fairness properties. Multi-teacher ensembles do not consistently outperform the best single teacher.

Evaluation and Benchmarking Inference Economics knowledge distillation Stratified Out-of-Fold Teacher Labeling AUC +2 more

5arXiv · cs.CL·3d ago·source ↗

Fine-tuning LLMs to passively estimate depression severity from AI mental health conversations

Researchers fine-tune a Qwen3.5-27B model with a regression head to predict PHQ-9 depression severity scores directly from AI mental health app conversation transcripts, eliminating the need for explicit self-report completion. The training set of 6,283 users combines 3,111 ground-truth labels with pseudolabels generated by Claude Opus and iterative intermediate models. On a held-out test of 842 users, the best model achieves MAE=2.6, Pearson r=0.80, and AUC=0.91 at the clinical PHQ-9≥10 threshold, with AUC>0.87 across all severity thresholds. The work demonstrates a passive, continuous symptom-monitoring approach that could reduce response bias in mental health platforms.

Enterprise Deployment Patterns Claude Opus 4.6 Patient Health Questionnaire-9 Qwen3.6-27B +1 more

5Hugging Face Blog·1mo ago·source ↗

Can Foundation Models Label Data Like Humans?

This Hugging Face blog post examines whether foundation models can serve as substitutes for human annotators in RLHF data labeling pipelines. It investigates the reliability and quality of model-generated preference labels compared to human-generated ones, with implications for scalable oversight and alignment research. The analysis is framed around the Open LLM Leaderboard and RLHF methodology.

Evaluation and Benchmarking Alignment and RLHF Reinforcement Learning from Human Feedback Open LLM Leaderboard Hugging Face +1 more

5Hugging Face Blog·1mo ago·source ↗

The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare

Hugging Face has launched the Open Medical-LLM Leaderboard, a public benchmark for evaluating large language models on healthcare and medical tasks. The leaderboard aggregates performance across multiple medical question-answering datasets to enable standardized comparison of open-weight models in clinical and biomedical domains. This initiative aims to accelerate progress in medical AI by providing transparent, reproducible evaluation infrastructure.

Evaluation and Benchmarking Open Weights Progress PubMedQA Open Medical-LLM Leaderboard MedMCQA +3 more

7Google Deepmind Blog·1mo ago·source ↗

DeepMind Launches 27B Parameter Gemma-Based Foundation Model for Single-Cell Analysis

DeepMind has released a new 27 billion parameter foundation model built on the Gemma open-model family, specifically designed for single-cell biological analysis. The model contributed to the discovery of a new potential cancer therapy pathway. This represents a significant application of large language model architecture to computational biology and genomics research.

Frontier Model Releases Open Weights Progress DeepMind Gemma Google +2 more

6arXiv · cs.CL·9d ago·source ↗

OpenMedReason: Large-scale multimodal medical reasoning corpus with 450K instances for clinical VLM training

Researchers introduce OpenMedReason, a 450K-instance open multimodal medical reasoning corpus with reasoning traces derived from human-authored biomedical literature rather than synthetic chains of thought. The dataset covers diverse medical imaging modalities and is paired with OpenMedReason-Bench, a held-out benchmark evaluating LVLMs on perception, medical knowledge, and rationale axes. Training with OpenMedReason yields a 20% average VQA accuracy improvement over base models and achieves performance within 4.2% of leading comparable-scale medical VLMs. Both the dataset and code are publicly released.

Evaluation and Benchmarking Alignment and RLHF OpenMedReason OpenMedReason-Bench +1 more

6arXiv · cs.CL·11d ago·source ↗

Clinically grounded privacy evaluation framework reveals high memorization risk in medical LMs

Researchers introduce a tiered adversarial framework for evaluating privacy leakage in medical language models, moving beyond simple training-text recovery to realistic clinical threat models. Applied to an LM pretrained on 378k clinical notes, the framework finds that routine encounter metadata (name, DOB, provider, visit date) elicits high verbatim memorization and sensitive-diagnosis recovery (AUROC 0.91 for abortion, 0.81 for HIV). The study also finds that exact-match memorization overstates disclosure risk because 36% of memorized tokens reflect templated documentation. The work provides a practical contextual privacy evaluation methodology for medical LMs trained on longitudinal patient data.

Evaluation and Benchmarking AI Safety Research Clinically Grounded Privacy Evaluation of Medical LMs +1 more