6arXiv cs.CL (Computation and Language)·5d ago

Computational audit finds ClinicalBERT amplifies demographic bias beyond training data distributions

Researchers present a systematic audit of representational bias in ClinicalBERT, a BERT-based model pretrained on MIMIC-III clinical discharge summaries, using two probing methodologies: Log Probability Bias Analysis and Masked Language Model probing across 98 clinical sentence templates and eight intersectional race-gender combinations. Of 32 statistically significant findings, 65.6% contradict observed corpus distributions, rising to 80% for Black patients and 87.5% for agency attribution under MLM probing. The key finding is that bias in ClinicalBERT operates predominantly through model-internal amplification rather than simple inheritance from training data, which has direct implications for clinical AI safety and deployment. This challenges the assumption that auditing training corpora is sufficient to characterize model bias.

Evaluation and Benchmarking AI Safety Research A Computational Audit of Demographic Association Encoding in ClinicalBERT Language Predictions MIMIC-III ClinicalBERT Log Probability Bias Analysis

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4The Batch·1mo ago·source ↗

Abeba Birhane on Bias in Web-Scraped Training Datasets

Researcher Abeba Birhane examines how large-scale web-scraped datasets used to train trillion-parameter NLP and vision models propagate bias and antisocial content. The commentary highlights that performance gains in deep neural networks come alongside inherited societal biases from web training data. Two posts from The Batch summarize her work on cleaning up web datasets and the specific mechanisms by which NLP models absorb web-sourced biases.

Evaluation and Benchmarking AI Safety Research DeepLearning.AI Abeba Birhane The Batch

6arXiv · cs.AI·1mo ago·source ↗

Auditing Value Pluralism in Clinical Ethics of Large Language Models

Researchers present a framework for auditing ethical value pluralism in medical AI, comprising a benchmark of clinician-verified dilemmas and an attribution method that recovers value priorities from model decisions. While frontier LLMs span physician-level value heterogeneity in aggregate and discuss competing values in reasoning, individual model decisions are near-deterministic and fail to reproduce the distributional pluralism of physician panels. Some models systematically underweight patient autonomy. The authors warn that deploying a single LLM at scale risks replacing clinical pluralism with a 'deployment monoculture.'

Evaluation and Benchmarking AI Safety Research Clinical Ethics Benchmark Value Pluralism Audit Framework Overton Pluralism +4 more

5arXiv · cs.CL·11d ago·source ↗

BODHI: Contrastive embedding training for causal discovery in Large Behavioural Models

Researchers identify a critical failure mode in biomedical language model embeddings: off-the-shelf encoders (BioBERT, PubMedBERT, BioM-ELECTRA) assign high cosine similarity (0.76–0.92) to causally unrelated cross-domain pairs, achieving 0% accuracy on cross-domain discrimination. The paper introduces BODHI, a contrastive training approach using hard negatives mined from a biomedical knowledge graph, which improves within-vs-across-domain separation from 1.05x to 2.30x and raises discrimination gap by +0.392. The work targets Large Behavioural Models (LBMs)—foundation models that reason over personal life graphs—where false embedding proximity directly produces false causal edges. Additional contributions include an OpenVINO inference optimization achieving 133x latency reduction (1367ms to 10ms) on Intel AMX hardware, plus a counterintuitive finding that FP16 outperforms INT8 on this silicon.

Evaluation and Benchmarking Inference Economics BIOSSES BioBERT PubMedBERT +4 more

7arXiv · cs.CL·25d ago·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.

Frontier Model Releases Evaluation and Benchmarking NeurIPS Auto Benchmark Audit (ABA)SWE-Bench Verified +2 more

5arXiv · cs.CL·12d ago·source ↗

Systematic evaluation of LLM prompt sensitivity in healthcare settings reveals safety risks

Researchers conduct a sensitivity analysis of both general-purpose and medical-specific LLMs using the MedMCQA benchmark, testing robustness to lexical and syntactic prompt perturbations. The study finds that even minor phrasing changes can alter clinical advice, and adversarial prompts can produce dangerous outputs such as incorrect dosages or omitted critical findings. Both general-purpose models (GPT-3.5, Llama 3) and domain-specific models (ClinicalBERT, BioLlama3, BioBERT) exhibit this fragility, with syntactic reordering and misleading contextual cues proving more destabilizing than simple paraphrasing.

Evaluation and Benchmarking AI Safety Research BioLlama3 BioBERT MedMCQA +3 more

4Hugging Face Blog·1mo ago·source ↗

Evaluating Language Model Bias with 🤗 Evaluate

This Hugging Face blog post introduces tooling and methodology for evaluating bias in language models using the Evaluate library. It covers bias measurement approaches and how practitioners can apply them to assess fairness properties of LLMs. The post is oriented toward applied practitioners working with open-source models.

Evaluation and Benchmarking AI Safety Research Hugging Face Evaluate Hugging Face

4Hugging Face Blog·1mo ago·source ↗

Ethics and Society Newsletter #4: Bias in Text-to-Image Models

Hugging Face's Ethics and Society team publishes their fourth newsletter focusing on bias in text-to-image generative models. The piece examines how these models encode and reproduce societal biases in visual outputs, likely covering evaluation methods, documented failure modes, and mitigation approaches. As a Tier 2 commentary piece from a major ML platform, it contributes to ongoing discourse around fairness and safety in multimodal AI systems.

Evaluation and Benchmarking AI Safety Research Hugging Face Ethics and Society Team text-to-image models Hugging Face +1 more

6arXiv · cs.CL·11d ago·source ↗

Clinically grounded privacy evaluation framework reveals high memorization risk in medical LMs

Researchers introduce a tiered adversarial framework for evaluating privacy leakage in medical language models, moving beyond simple training-text recovery to realistic clinical threat models. Applied to an LM pretrained on 378k clinical notes, the framework finds that routine encounter metadata (name, DOB, provider, visit date) elicits high verbatim memorization and sensitive-diagnosis recovery (AUROC 0.91 for abortion, 0.81 for HIV). The study also finds that exact-match memorization overstates disclosure risk because 36% of memorized tokens reflect templated documentation. The work provides a practical contextual privacy evaluation methodology for medical LMs trained on longitudinal patient data.

Evaluation and Benchmarking AI Safety Research Clinically Grounded Privacy Evaluation of Medical LMs +1 more