4arXiv cs.CL (Computation and Language)·24d ago

Interaction SSD: Modeling Annotator Identity Effects on Hate Speech Semantic Gradients

This paper introduces Interaction SSD, an extension of Supervised Semantic Differential that tests how semantic meaning varies across moderating variables such as annotator group identity. Applied to the UC Berkeley Measuring Hate Speech corpus, the method detects that annotator racial identity significantly moderates hate-speech judgments, with a shared gradient distinguishing dehumanizing hostility from counter-speech and an interaction gradient revealing group-linked differences in predictive semantic cues. The approach makes moderated meaning-outcome relationships statistically testable and interpretable through standard SSD tooling.

Evaluation and Benchmarking AI Safety Research Supervised Semantic Differential Interaction SSD UC Berkeley Measuring Hate Speech Corpus University of California, Berkeley

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·15d ago·source ↗

LLMs fail to consistently simulate demographic perspective-taking in hate speech annotation

A new arXiv paper evaluates whether persona-conditioned LLMs can replicate how different demographic groups perceive hate speech, testing three dimensions: inter-group disagreement, in-group sensitivity, and vicarious prediction. No model consistently captures all three dimensions, and performance is highly model-dependent rather than emerging reliably from identity prompts alone. Vicarious prompting with Llama 3.1 provides the closest approximation to human disagreement patterns across demographic axes. The findings have implications for using LLMs as proxies for diverse human annotators in content moderation tasks.

Evaluation and Benchmarking AI Safety Research From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation Meta Llama-3.1-8B

5arXiv · cs.CL·24d ago·source ↗

When Does Demographic Information Help? Data and Modeling Regimes for Perspective-Aware Hate Speech Detection

This paper investigates when demographic features improve hate speech detection models that account for annotator perspectives, finding that gains are not universal but depend on specific data and modeling conditions. The authors identify that demographic information helps most in regimes with low training disagreement, high test disagreement, sufficient training data, and strong demographic overlap between train and test sets. They introduce a gated demographic residual model that selectively applies demographic adjustments to text-only predictions, demonstrating effectiveness on high-disagreement and low-confidence examples using the MHS and POPQUORN datasets. The work cautions against assuming demographic features are universally beneficial in subjective NLP tasks.

Evaluation and Benchmarking Alignment and RLHF MHS POPQUORN gated demographic residual model +2 more

5arXiv · cs.CL·19d ago·source ↗

Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection

This paper investigates human disagreement in token-level rationale annotations for hate speech detection, a dimension less studied than label disagreement. The authors unify diverse models, training strategies, loss functions, and evaluation metrics under a single protocol, systematically comparing hard and soft label/rationale representation spaces. Results show that both hard and soft metrics favor softer representations, suggesting that soft supervision better captures human reasoning variation in subjective NLP tasks. The work calls for rethinking evaluation frameworks for classification and explainability in subjective NLP.

Evaluation and Benchmarking Alignment and RLHF Token-level Rationales Faithfulness Evaluation Plausibility Evaluation +2 more

5arXiv · cs.CL·9d ago·source ↗

Situated Interaction Auditing: A user-centered framework for LLM bias research

Researchers propose Situated Interaction Auditing (SIA), a new framework for studying LLM bias from the perspective of the user rather than third-party demographic representation. The core insight is that bias can manifest in how a model treats its interlocutor — varying response quality, content, and tone based on implicit sociodemographic signals, writing style, or stated identity — rather than only in how it describes external groups. The paper demonstrates SIA through a case study intersecting gender and socioeconomic status signals across multiple task domains and outlines a research agenda for the approach.

Evaluation and Benchmarking AI Safety Research Beyond Third-Person Audits: Situated Interaction Auditing for User-Centered LLM Bias Research Situated Interaction Auditing

4arXiv · cs.CL·11d ago·source ↗

Study reveals how self-supervised speech models encode speaker group attributes across fine-tuning stages

Researchers investigate what self-supervised speech recognition models (S3Ms) learn about speaker group categories including gender, age, dialect, ethnicity, and native-speaker status across pretrained, SID-finetuned, ASR-finetuned, and fairness-enhanced states. They find that SID fine-tuning amplifies phonetically variant speaker group information while ASR fine-tuning discards it but retains semantically variant information. Fairness-enhancing ASR algorithms primarily affect phonetically variant speaker group encoding but have limited impact on semantically variant categories. The findings offer guidance for designing fairer ASR systems.

Evaluation and Benchmarking AI Safety Research Speaker Group Encoding in Self-supervised Speech Recognition Models

4arXiv · cs.CL·25d ago·source ↗

WhoSaidIt: Human-LLM Collaborative Annotation for Multilingual Speaker-Attribute Classification

This paper proposes a human-LLM collaborative re-annotation framework for stabilizing noisy multilingual speaker-attribute labels under resource constraints. LLMs surface recurring annotation rationales through iterative expert interaction, combined with disagreement-focused sampling for targeted re-annotation. The resulting WhoSaidIt dataset covers nine speaker-attribute labels across multiple languages. Benchmarking of recent LLMs reveals substantial cross-lingual annotation divergence and highlights both capabilities and limitations of LLMs in this classification task.

Evaluation and Benchmarking Agent and Tool Ecosystem human-LLM collaborative annotation speaker-attribute classification WhoSaidIt +1 more

7arXiv · cs.LG·1mo ago·source ↗

AI-Mediated Communication Can Steer Collective Opinion via LLM Editing Biases

This paper demonstrates empirically that LLMs from multiple model families introduce directional biases when editing human-written texts on contested topics (e.g., nudging toward gun control, against atheism). The authors develop a mathematical opinion-dynamics model showing these biases are amplified through social networks, shifting collective opinion at scale. An audit of X's 'Explain this post' feature finds evidence of pro-life bias in Grok's outputs on abortion content, traced to specific design choices. The paper concludes with implications for EU legislative efforts on AI-mediated communication.

Evaluation and Benchmarking AI Safety Research Grok X (Twitter)EU AI Act +5 more

4arXiv · cs.CL·10d ago·source ↗

Calibrated LLM annotation and encoder transfer for measuring human values in social media text

A new arXiv preprint investigates how different LLMs, prompts, and instruction languages operationalize Schwartz's theory of basic human values when annotating non-English social media posts. The authors evaluate annotation quality beyond standard F1 metrics, examining structural alignment, error structure, and confidence-ambiguity relations, finding that iterative prompt calibration reduces misattributions. They also demonstrate that LLM annotations can be transferred to a smaller encoder model via soft-label training, preserving theory-grounded value interpretations and uncertainty information.

Evaluation and Benchmarking Alignment and RLHF Schwartz's Theory of Basic Human Values Measuring Human Value Expression in Social Media Texts: Calibrated LLM Annotation and Encoder Transfer