5arXiv cs.AI (Artificial Intelligence)·11d ago

Audit of 39 deepfake speech datasets reveals fairness and generalization gaps

A dataset-level audit of 39 deepfake speech datasets examines accessibility, documentation, demographic coverage, scale, and source corpora. The study finds that fairness assessment is largely infeasible due to missing demographic metadata, and that substantial overlap in underlying speech corpora across datasets undermines cross-dataset evaluation and inflates generalization claims. The findings challenge the credibility of robustness and fairness claims made for deepfake speech detectors.

Evaluation and Benchmarking AI Safety Research Ethical and Technical Limits of Deepfake Speech Datasets

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·11d ago·source ↗

Explainability pipeline reveals divergent cues used by deepfake speech detectors

Researchers propose an audio-native explainability pipeline using Integrated Gradients on time-aligned self-supervised representations to localize decision evidence in deepfake speech detectors. Applied to three WavLM-based detectors (AASIST, CA-MHFA, SLS) on the ASVspoof 5 benchmark, the method reveals that despite similar performance, each detector relies on fundamentally different cues: environmental noise, phoneme artifacts, and word boundaries respectively. Findings are validated via causal masking experiments that confirm performance degrades when primary cues are removed. The work advances interpretability of audio deepfake detection, relevant to AI safety and media authenticity.

Evaluation and Benchmarking AI Safety Research CA-MHFA Integrated Gradients SLS +4 more

6arXiv · cs.CL·5d ago·source ↗

Computational audit finds ClinicalBERT amplifies demographic bias beyond training data distributions

Researchers present a systematic audit of representational bias in ClinicalBERT, a BERT-based model pretrained on MIMIC-III clinical discharge summaries, using two probing methodologies: Log Probability Bias Analysis and Masked Language Model probing across 98 clinical sentence templates and eight intersectional race-gender combinations. Of 32 statistically significant findings, 65.6% contradict observed corpus distributions, rising to 80% for Black patients and 87.5% for agency attribution under MLM probing. The key finding is that bias in ClinicalBERT operates predominantly through model-internal amplification rather than simple inheritance from training data, which has direct implications for clinical AI safety and deployment. This challenges the assumption that auditing training corpora is sufficient to characterize model bias.

Evaluation and Benchmarking AI Safety Research A Computational Audit of Demographic Association Encoding in ClinicalBERT Language Predictions MIMIC-III ClinicalBERT +1 more

6Google Deepmind Blog·1mo ago·source ↗

FACTS Benchmark Suite: Systematically evaluating the factuality of large language models

DeepMind has released the FACTS Benchmark Suite, a systematic evaluation framework for measuring the factuality of large language models. The benchmark is designed to assess how accurately LLMs produce factually grounded outputs. This represents a structured contribution to the growing field of LLM evaluation, specifically targeting hallucination and factual reliability. The announcement comes from a Tier 1 lab, lending it credibility as a reference benchmark in the field.

Evaluation and Benchmarking AI Safety Research FACTS Benchmark Suite Google DeepMind

4The Batch·1mo ago·source ↗

Abeba Birhane on Bias in Web-Scraped Training Datasets

Researcher Abeba Birhane examines how large-scale web-scraped datasets used to train trillion-parameter NLP and vision models propagate bias and antisocial content. The commentary highlights that performance gains in deep neural networks come alongside inherited societal biases from web training data. Two posts from The Batch summarize her work on cleaning up web datasets and the specific mechanisms by which NLP models absorb web-sourced biases.

Evaluation and Benchmarking AI Safety Research DeepLearning.AI Abeba Birhane The Batch

5Ai Snake Oil·1mo ago·source ↗

We Looked at 78 Election Deepfakes. Political Misinformation is not an AI Problem.

An analysis of 78 election-related deepfakes argues that political misinformation is fundamentally not an AI problem, challenging the prevailing narrative that AI-generated content is the primary driver of electoral disinformation. The piece contends that technology is neither the root cause nor the solution to political misinformation. Published on the AI Snake Oil / Normal Tech platform, this represents a data-informed commentary pushing back on AI-centric framings of election integrity concerns.

AI Safety Research Regulatory Developments election deepfakes Normal Tech AI Snake Oil

5arXiv · cs.AI·11d ago·source ↗

RAT: Reference-Augmented Training improves deepfake audio detection without reference at inference

Researchers introduce Reference-Augmented Training (RAT), a training strategy for automatic speaker verification (ASV) anti-spoofing that conditions a model on speaker-reference recordings during training but discovers the model learns to ignore the reference at inference. Counterintuitively, this training regime induces invariances that improve deepfake detection even when the reference is replaced with a zero vector at test time. RAT achieves state-of-the-art 2.57% EER and 0.074 minDCF on the ASVspoof 5 benchmark with a single detector, outperforming large ensemble systems.

Evaluation and Benchmarking AI Safety Research RAT: Reference-Augmented Training for ASV Anti-Spoofing Reference-Augmented Training ASVspoof 5

5Openai Blog·1mo ago·source ↗

Evaluating Fairness in ChatGPT

OpenAI published an analysis of how ChatGPT responds differently to users based on their names, using AI research assistants to conduct the evaluation while protecting user privacy. The study examines potential demographic or identity-based disparities in model outputs. This represents OpenAI's ongoing internal fairness and bias evaluation work on its flagship product.

Evaluation and Benchmarking AI Safety Research ChatGPT OpenAI

4arXiv · cs.CL·11d ago·source ↗

Study reveals how self-supervised speech models encode speaker group attributes across fine-tuning stages

Researchers investigate what self-supervised speech recognition models (S3Ms) learn about speaker group categories including gender, age, dialect, ethnicity, and native-speaker status across pretrained, SID-finetuned, ASR-finetuned, and fairness-enhanced states. They find that SID fine-tuning amplifies phonetically variant speaker group information while ASR fine-tuning discards it but retains semantically variant information. Fairness-enhancing ASR algorithms primarily affect phonetically variant speaker group encoding but have limited impact on semantically variant categories. The findings offer guidance for designing fairer ASR systems.

Evaluation and Benchmarking AI Safety Research Speaker Group Encoding in Self-supervised Speech Recognition Models