5arXiv cs.CL (Computation and Language)·24d ago

When Does Demographic Information Help? Data and Modeling Regimes for Perspective-Aware Hate Speech Detection

This paper investigates when demographic features improve hate speech detection models that account for annotator perspectives, finding that gains are not universal but depend on specific data and modeling conditions. The authors identify that demographic information helps most in regimes with low training disagreement, high test disagreement, sufficient training data, and strong demographic overlap between train and test sets. They introduce a gated demographic residual model that selectively applies demographic adjustments to text-only predictions, demonstrating effectiveness on high-disagreement and low-confidence examples using the MHS and POPQUORN datasets. The work cautions against assuming demographic features are universally beneficial in subjective NLP tasks.

Evaluation and Benchmarking Alignment and RLHF MHS POPQUORN gated demographic residual model hate speech detection annotator disagreement

Related guides (2)

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·15d ago·source ↗

LLMs fail to consistently simulate demographic perspective-taking in hate speech annotation

A new arXiv paper evaluates whether persona-conditioned LLMs can replicate how different demographic groups perceive hate speech, testing three dimensions: inter-group disagreement, in-group sensitivity, and vicarious prediction. No model consistently captures all three dimensions, and performance is highly model-dependent rather than emerging reliably from identity prompts alone. Vicarious prompting with Llama 3.1 provides the closest approximation to human disagreement patterns across demographic axes. The findings have implications for using LLMs as proxies for diverse human annotators in content moderation tasks.

Evaluation and Benchmarking AI Safety Research From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation Meta Llama-3.1-8B

5arXiv · cs.CL·19d ago·source ↗

Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection

This paper investigates human disagreement in token-level rationale annotations for hate speech detection, a dimension less studied than label disagreement. The authors unify diverse models, training strategies, loss functions, and evaluation metrics under a single protocol, systematically comparing hard and soft label/rationale representation spaces. Results show that both hard and soft metrics favor softer representations, suggesting that soft supervision better captures human reasoning variation in subjective NLP tasks. The work calls for rethinking evaluation frameworks for classification and explainability in subjective NLP.

Evaluation and Benchmarking Alignment and RLHF Token-level Rationales Faithfulness Evaluation Plausibility Evaluation +2 more

4arXiv · cs.CL·24d ago·source ↗

Interaction SSD: Modeling Annotator Identity Effects on Hate Speech Semantic Gradients

This paper introduces Interaction SSD, an extension of Supervised Semantic Differential that tests how semantic meaning varies across moderating variables such as annotator group identity. Applied to the UC Berkeley Measuring Hate Speech corpus, the method detects that annotator racial identity significantly moderates hate-speech judgments, with a shared gradient distinguishing dehumanizing hostility from counter-speech and an interaction gradient revealing group-linked differences in predictive semantic cues. The approach makes moderated meaning-outcome relationships statistically testable and interpretable through standard SSD tooling.

Evaluation and Benchmarking AI Safety Research Supervised Semantic Differential Interaction SSD UC Berkeley Measuring Hate Speech Corpus +1 more

4arXiv · cs.CL·11d ago·source ↗

Study reveals how self-supervised speech models encode speaker group attributes across fine-tuning stages

Researchers investigate what self-supervised speech recognition models (S3Ms) learn about speaker group categories including gender, age, dialect, ethnicity, and native-speaker status across pretrained, SID-finetuned, ASR-finetuned, and fairness-enhanced states. They find that SID fine-tuning amplifies phonetically variant speaker group information while ASR fine-tuning discards it but retains semantically variant information. Fairness-enhancing ASR algorithms primarily affect phonetically variant speaker group encoding but have limited impact on semantically variant categories. The findings offer guidance for designing fairer ASR systems.

Evaluation and Benchmarking AI Safety Research Speaker Group Encoding in Self-supervised Speech Recognition Models

4arXiv · cs.CL·29d ago·source ↗

Systematic Study of Schwartz Value Detection in Political Texts: Context, Scale, and Moral Knowledge

This paper investigates when additional context, larger models, or retrieved moral knowledge improve detection of Schwartz human values in political text using the ValueEval benchmark format. Key findings show that full-document context helps supervised DeBERTa encoders (+3.8–4.8 macro-F1) but not zero-shot LLMs, while RAG with a curated moral knowledge base consistently benefits all model families under early fusion. Scaling model size does not guarantee gains, and simple early fusion outperforms more complex RAG variants. The study recommends jointly evaluating context, knowledge, and model family rather than assuming larger inputs or models universally improve value-sensitive NLP.

Evaluation and Benchmarking Agent and Tool Ecosystem TouchéValueML DeBERTa-v3 Retrieval-Augmented Generation +2 more

7arXiv · cs.LG·1mo ago·source ↗

AI-Mediated Communication Can Steer Collective Opinion via LLM Editing Biases

This paper demonstrates empirically that LLMs from multiple model families introduce directional biases when editing human-written texts on contested topics (e.g., nudging toward gun control, against atheism). The authors develop a mathematical opinion-dynamics model showing these biases are amplified through social networks, shifting collective opinion at scale. An audit of X's 'Explain this post' feature finds evidence of pro-life bias in Grok's outputs on abortion content, traced to specific design choices. The paper concludes with implications for EU legislative efforts on AI-mediated communication.

Evaluation and Benchmarking AI Safety Research Grok X (Twitter)EU AI Act +5 more

4The Batch·1mo ago·source ↗

Abeba Birhane on Bias in Web-Scraped Training Datasets

Researcher Abeba Birhane examines how large-scale web-scraped datasets used to train trillion-parameter NLP and vision models propagate bias and antisocial content. The commentary highlights that performance gains in deep neural networks come alongside inherited societal biases from web training data. Two posts from The Batch summarize her work on cleaning up web datasets and the specific mechanisms by which NLP models absorb web-sourced biases.

Evaluation and Benchmarking AI Safety Research DeepLearning.AI Abeba Birhane The Batch

5arXiv · cs.CL·1mo ago·source ↗

Text Analytics Evaluation Framework: Benchmarking LLMs on Social Media NLP Tasks

Researchers introduce a 470-question evaluation framework to assess LLM performance on aggregated social media text, applied to Twitter datasets across sentiment analysis, hate speech detection, and emotion recognition. Results show performance degrades substantially as input scale exceeds 500 instances, particularly for open-weights models on numerical tasks. Multi-label and target-dependent scenarios also show notable performance drops, and task complexity progressively erodes accuracy from basic semantic identification to comparison and counting operations. The findings point to architectural bottlenecks in current LLMs for rigorous quantitative analysis over large text collections.

Long Context Evolution Evaluation and Benchmarking Emotion Recognition Text Analytics Evaluation Framework X (Twitter)+3 more