3arXiv cs.CL (Computation and Language)·15h ago

Domain-specific transformer embeddings for detecting dosing errors in clinical trial protocols

Researchers from CaresAI evaluated biomedical transformer models (ClinicalBERT, PubMedBERT, BioBERT, MedCPT) for detecting dosing errors in clinical trial protocols, combining text embeddings with structured metadata and classical ML classifiers. BioBERT achieved the best single-encoder performance at ROC-AUC 0.794, while gradient boosting and SVM ensembles reached 0.821–0.853. The study finds that domain alignment of the encoder matters more than stacking multiple embeddings, and demonstrates a practical NLP pipeline for clinical trial safety monitoring.

Evaluation and Benchmarking BioBERT CaresAI ClinicalBERT CT-DEB26 PubMedBERT MedCPT

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·21d ago·source ↗

BODHI: Contrastive embedding training for causal discovery in Large Behavioural Models

Researchers identify a critical failure mode in biomedical language model embeddings: off-the-shelf encoders (BioBERT, PubMedBERT, BioM-ELECTRA) assign high cosine similarity (0.76–0.92) to causally unrelated cross-domain pairs, achieving 0% accuracy on cross-domain discrimination. The paper introduces BODHI, a contrastive training approach using hard negatives mined from a biomedical knowledge graph, which improves within-vs-across-domain separation from 1.05x to 2.30x and raises discrimination gap by +0.392. The work targets Large Behavioural Models (LBMs)—foundation models that reason over personal life graphs—where false embedding proximity directly produces false causal edges. Additional contributions include an OpenVINO inference optimization achieving 133x latency reduction (1367ms to 10ms) on Intel AMX hardware, plus a counterintuitive finding that FP16 outperforms INT8 on this silicon.

Evaluation and Benchmarking Inference Economics BIOSSES BioBERT PubMedBERT +4 more

6arXiv · cs.CL·15d ago·source ↗

Computational audit finds ClinicalBERT amplifies demographic bias beyond training data distributions

Researchers present a systematic audit of representational bias in ClinicalBERT, a BERT-based model pretrained on MIMIC-III clinical discharge summaries, using two probing methodologies: Log Probability Bias Analysis and Masked Language Model probing across 98 clinical sentence templates and eight intersectional race-gender combinations. Of 32 statistically significant findings, 65.6% contradict observed corpus distributions, rising to 80% for Black patients and 87.5% for agency attribution under MLM probing. The key finding is that bias in ClinicalBERT operates predominantly through model-internal amplification rather than simple inheritance from training data, which has direct implications for clinical AI safety and deployment. This challenges the assumption that auditing training corpora is sufficient to characterize model bias.

Evaluation and Benchmarking AI Safety Research A Computational Audit of Demographic Association Encoding in ClinicalBERT Language Predictions MIMIC-III ClinicalBERT +1 more

4arXiv · cs.CL·29d ago·source ↗

IndicBERT-HPA: Reliability-Oriented Multilingual Orthopedic Decision Support with Selective Verification Deferral

This paper presents a framework for classifying free-text orthopedic clinical notes in English, Hindi, and Punjabi, introducing IndicBERT-HPA, a domain-adaptive encoder augmented with language-aware orthopedic adapter heads. The system is evaluated against multilingual transformers, a DistilBERT baseline, and zero-shot LLMs, with zero-shot LLMs found substantially less effective than task-adapted encoders for closed-set clinical classification. IndicBERT-HPA achieves Macro-F1 of 0.8792 and AUPRC of 0.902 under natural clinical prevalence. A deterministic selective-verification layer combining confidence gating, evidence-consistency checking, and language-risk screening improves accuracy from 71.5% to 84.4% at 72.3% coverage on a 5,000-record held-out set.

Evaluation and Benchmarking Enterprise Deployment Patterns confidence gating language-aware adapter heads IndicBERT-HPA +3 more

4arXiv · cs.CL·1mo ago·source ↗

Automated ICD Classification of Psychiatric Diagnoses Using NLP and LLMs

This study evaluates NLP and ML approaches for automating the mapping of free-text psychiatric descriptions to ICD diagnostic codes, using a dataset of 145,513 Spanish clinical records. Methods range from classical BoW/TF-IDF representations to transformer-based embeddings including e5_large, BioLORD, and Llama-3-8B. Fine-tuned e5_large achieved the best performance with a micro-F1 of 0.866, outperforming classical methods by capturing semantic nuance and medical terminology. The work highlights challenges of long-tail label distributions and ambiguity specific to psychiatric clinical language.

Enterprise Deployment Patterns Agent and Tool Ecosystem International Classification of Diseases (ICD)e5_large Bag of Words (BoW)+3 more

4arXiv · cs.CL·39h ago·source ↗

Multi-stage explainability framework translates transformer speech models into clinical cognitive impairment narratives

A new arXiv preprint proposes a framework for making transformer-based speech cognitive impairment detection clinically interpretable by combining SHAP token attribution, linguistic feature analysis, and a four-stage LLM reasoning pipeline using LLaMA-3.1-70B-Instruct. The system is built on the SpeechCARE-Adaptive Gating Network multimodal model (F1=72.11% on NIA PREPARE) and maps outputs to four cognitive-linguistic dimensions. Physician evaluation on 70 samples showed strong alignment with clinical profiles and a System Usability Scale score of 82/100, suggesting practical clinical workflow integration potential.

Evaluation and Benchmarking AI Safety Research NIA PREPARE Llama 3.3 70B Instruct SpeechCARE-Adaptive Gating Network +3 more

4arXiv · cs.CL·28d ago·source ↗

Sentence-Level Clinical Provenance Categorization for Multidisciplinary Hospital Summarization Using Fine-Tuned Llama-3

This pilot study presents a pipeline for categorizing sentence-level clinical provenance across multi-source hospital notes, targeting structured summarization in high-complexity settings like the NICU. The authors fine-tune Llama-3 8B and 70B models on MedSecId (MIMIC-III annotations), achieving Macro F1 above 92% in-domain. Cross-domain evaluation reveals a scale-dependent transfer effect: SFT substantially improves the 70B model (+7% Macro F1) but yields only marginal gains for the 8B model. A quantized fine-tuned 70B model outperforms its full-precision baseline while reducing compute, suggesting quantized adaptation is viable for structured clinical NLP tasks.

Inference Economics Enterprise Deployment Patterns MIMIC-III Llama 3.1 70B quantization +4 more

4arXiv · cs.CL·7d ago·source ↗

Energy-based transformers as unified predictors of reading difficulty in computational psycholinguistics

A new arXiv preprint introduces energy-based transformer measures as predictors of human reading difficulty, evaluated across three reading-time corpora (Natural Stories, UCL eye-tracking, UCL self-paced reading). The energy measure outperforms surprisal alone and appears to subsume both surprisal and attention entropy effects, suggesting it could serve as a single unified predictor. The work connects transformer language models to Hopfield networks and dense associative memory literature, marking the first application of energy-based transformer measures in computational psycholinguistics.

Evaluation and Benchmarking Natural Stories Energy-Based Transformers as Predictors of Reading Difficulty Hopfield Networks

4arXiv · cs.CL·14d ago·source ↗

Transformer embeddings shown to intrinsically encode Russell's circumplex model of emotion geometry

A new arXiv paper investigates whether Transformer-based text and speech encoders (RoBERTa, wav2vec 2.0) recover the geometric structure of Russell's circumplex model of affect — a valence-arousal topology from psychology. Experiments on naturalistic datasets (MSP-Podcast) and LLM-generated stimuli show that multimodal fusion achieves perfect topological alignment with Russell's primary emotion ordering, and zero-shot generic text embeddings place fine-grained emotion terms near their human-mapped coordinates. The authors argue this structure is intrinsically encoded in the representations rather than being an artifact of labeling, bridging psychological theory and representation learning.

Evaluation and Benchmarking Multimodal Progress Data-Driven Decoding of Russell's Circumplex Model of Affect RoBERTa MSP-Podcast +1 more