Interaction SSD: Modeling Annotator Identity Effects on Hate Speech Semantic Gradients
This paper introduces Interaction SSD, an extension of Supervised Semantic Differential that tests how semantic meaning varies across moderating variables such as annotator group identity. Applied to the UC Berkeley Measuring Hate Speech corpus, the method detects that annotator racial identity significantly moderates hate-speech judgments, with a shared gradient distinguishing dehumanizing hostility from counter-speech and an interaction gradient revealing group-linked differences in predictive semantic cues. The approach makes moderated meaning-outcome relationships statistically testable and interpretable through standard SSD tooling.
Related guides (2)
Related events (8)
LLMs fail to consistently simulate demographic perspective-taking in hate speech annotation
A new arXiv paper evaluates whether persona-conditioned LLMs can replicate how different demographic groups perceive hate speech, testing three dimensions: inter-group disagreement, in-group sensitivity, and vicarious prediction. No model consistently captures all three dimensions, and performance is highly model-dependent rather than emerging reliably from identity prompts alone. Vicarious prompting with Llama 3.1 provides the closest approximation to human disagreement patterns across demographic axes. The findings have implications for using LLMs as proxies for diverse human annotators in content moderation tasks.
When Does Demographic Information Help? Data and Modeling Regimes for Perspective-Aware Hate Speech Detection
This paper investigates when demographic features improve hate speech detection models that account for annotator perspectives, finding that gains are not universal but depend on specific data and modeling conditions. The authors identify that demographic information helps most in regimes with low training disagreement, high test disagreement, sufficient training data, and strong demographic overlap between train and test sets. They introduce a gated demographic residual model that selectively applies demographic adjustments to text-only predictions, demonstrating effectiveness on high-disagreement and low-confidence examples using the MHS and POPQUORN datasets. The work cautions against assuming demographic features are universally beneficial in subjective NLP tasks.
Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection
This paper investigates human disagreement in token-level rationale annotations for hate speech detection, a dimension less studied than label disagreement. The authors unify diverse models, training strategies, loss functions, and evaluation metrics under a single protocol, systematically comparing hard and soft label/rationale representation spaces. Results show that both hard and soft metrics favor softer representations, suggesting that soft supervision better captures human reasoning variation in subjective NLP tasks. The work calls for rethinking evaluation frameworks for classification and explainability in subjective NLP.
Situated Interaction Auditing: A user-centered framework for LLM bias research
Researchers propose Situated Interaction Auditing (SIA), a new framework for studying LLM bias from the perspective of the user rather than third-party demographic representation. The core insight is that bias can manifest in how a model treats its interlocutor — varying response quality, content, and tone based on implicit sociodemographic signals, writing style, or stated identity — rather than only in how it describes external groups. The paper demonstrates SIA through a case study intersecting gender and socioeconomic status signals across multiple task domains and outlines a research agenda for the approach.
Study reveals how self-supervised speech models encode speaker group attributes across fine-tuning stages
Researchers investigate what self-supervised speech recognition models (S3Ms) learn about speaker group categories including gender, age, dialect, ethnicity, and native-speaker status across pretrained, SID-finetuned, ASR-finetuned, and fairness-enhanced states. They find that SID fine-tuning amplifies phonetically variant speaker group information while ASR fine-tuning discards it but retains semantically variant information. Fairness-enhancing ASR algorithms primarily affect phonetically variant speaker group encoding but have limited impact on semantically variant categories. The findings offer guidance for designing fairer ASR systems.
WhoSaidIt: Human-LLM Collaborative Annotation for Multilingual Speaker-Attribute Classification
This paper proposes a human-LLM collaborative re-annotation framework for stabilizing noisy multilingual speaker-attribute labels under resource constraints. LLMs surface recurring annotation rationales through iterative expert interaction, combined with disagreement-focused sampling for targeted re-annotation. The resulting WhoSaidIt dataset covers nine speaker-attribute labels across multiple languages. Benchmarking of recent LLMs reveals substantial cross-lingual annotation divergence and highlights both capabilities and limitations of LLMs in this classification task.
AI-Mediated Communication Can Steer Collective Opinion via LLM Editing Biases
This paper demonstrates empirically that LLMs from multiple model families introduce directional biases when editing human-written texts on contested topics (e.g., nudging toward gun control, against atheism). The authors develop a mathematical opinion-dynamics model showing these biases are amplified through social networks, shifting collective opinion at scale. An audit of X's 'Explain this post' feature finds evidence of pro-life bias in Grok's outputs on abortion content, traced to specific design choices. The paper concludes with implications for EU legislative efforts on AI-mediated communication.
Calibrated LLM annotation and encoder transfer for measuring human values in social media text
A new arXiv preprint investigates how different LLMs, prompts, and instruction languages operationalize Schwartz's theory of basic human values when annotating non-English social media posts. The authors evaluate annotation quality beyond standard F1 metrics, examining structural alignment, error structure, and confidence-ambiguity relations, finding that iterative prompt calibration reduces misattributions. They also demonstrate that LLM annotations can be transferred to a smaller encoder model via soft-label training, preserving theory-grounded value interpretations and uncertainty information.

