5arXiv cs.CL (Computation and Language)·15d ago

LLMs fail to consistently simulate demographic perspective-taking in hate speech annotation

A new arXiv paper evaluates whether persona-conditioned LLMs can replicate how different demographic groups perceive hate speech, testing three dimensions: inter-group disagreement, in-group sensitivity, and vicarious prediction. No model consistently captures all three dimensions, and performance is highly model-dependent rather than emerging reliably from identity prompts alone. Vicarious prompting with Llama 3.1 provides the closest approximation to human disagreement patterns across demographic axes. The findings have implications for using LLMs as proxies for diverse human annotators in content moderation tasks.

Evaluation and Benchmarking AI Safety Research From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation Meta Llama-3.1-8B

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Meta AI: The Open-Weights Giant Eyeing Superintelligence

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·24d ago·source ↗

When Does Demographic Information Help? Data and Modeling Regimes for Perspective-Aware Hate Speech Detection

This paper investigates when demographic features improve hate speech detection models that account for annotator perspectives, finding that gains are not universal but depend on specific data and modeling conditions. The authors identify that demographic information helps most in regimes with low training disagreement, high test disagreement, sufficient training data, and strong demographic overlap between train and test sets. They introduce a gated demographic residual model that selectively applies demographic adjustments to text-only predictions, demonstrating effectiveness on high-disagreement and low-confidence examples using the MHS and POPQUORN datasets. The work cautions against assuming demographic features are universally beneficial in subjective NLP tasks.

Evaluation and Benchmarking Alignment and RLHF MHS POPQUORN gated demographic residual model +2 more

7arXiv · cs.CL·46h ago·source ↗

LLM psychological profiles are largely measurement artifacts, not model properties

A new arXiv preprint administers a battery of personality and risk-preference instruments to 56 instruction-tuned LLMs alongside large human reference samples, finding that 81-90% of between-model variation is explained by directional response bias rather than the traits the instruments target. The authors introduce the concept of 'response orthogonality' to explain why some instruments appear more reliable than others, and show that apparent psychological profiles can be manufactured through item selection. The findings challenge the validity of using human-designed psychometric tools to characterize LLMs, with direct implications for safety assessment and the use of LLMs as proxies for human participants in research.

Evaluation and Benchmarking AI Safety Research Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

5arXiv · cs.CL·1mo ago·source ↗

Text Analytics Evaluation Framework: Benchmarking LLMs on Social Media NLP Tasks

Researchers introduce a 470-question evaluation framework to assess LLM performance on aggregated social media text, applied to Twitter datasets across sentiment analysis, hate speech detection, and emotion recognition. Results show performance degrades substantially as input scale exceeds 500 instances, particularly for open-weights models on numerical tasks. Multi-label and target-dependent scenarios also show notable performance drops, and task complexity progressively erodes accuracy from basic semantic identification to comparison and counting operations. The findings point to architectural bottlenecks in current LLMs for rigorous quantitative analysis over large text collections.

Long Context Evolution Evaluation and Benchmarking Emotion Recognition Text Analytics Evaluation Framework X (Twitter)+3 more

5arXiv · cs.CL·15d ago·source ↗

Counterfactual context revision framework for auditing LLM-based stance simulation in online discussions

Researchers introduce a counterfactual context revision framework to audit how LLMs simulate individual users' stances in online discussions. By applying controlled text-only and multimodal (meme-based) revisions to conversational contexts, they measure how readily simulated stances shift in response to semantically independent changes. Results show effective and robust stance transitions across both revision types and polarization-preference mechanisms, raising concerns about whether LLM simulations reflect genuine user-specific beliefs or are highly context-sensitive artifacts. The work contributes an evaluation framework and highlights risks of using LLMs to model online opinion dynamics.

Evaluation and Benchmarking AI Safety Research Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online Discussions

4arXiv · cs.CL·25d ago·source ↗

WhoSaidIt: Human-LLM Collaborative Annotation for Multilingual Speaker-Attribute Classification

This paper proposes a human-LLM collaborative re-annotation framework for stabilizing noisy multilingual speaker-attribute labels under resource constraints. LLMs surface recurring annotation rationales through iterative expert interaction, combined with disagreement-focused sampling for targeted re-annotation. The resulting WhoSaidIt dataset covers nine speaker-attribute labels across multiple languages. Benchmarking of recent LLMs reveals substantial cross-lingual annotation divergence and highlights both capabilities and limitations of LLMs in this classification task.

Evaluation and Benchmarking Agent and Tool Ecosystem human-LLM collaborative annotation speaker-attribute classification WhoSaidIt +1 more

7arXiv · cs.LG·1mo ago·source ↗

AI-Mediated Communication Can Steer Collective Opinion via LLM Editing Biases

This paper demonstrates empirically that LLMs from multiple model families introduce directional biases when editing human-written texts on contested topics (e.g., nudging toward gun control, against atheism). The authors develop a mathematical opinion-dynamics model showing these biases are amplified through social networks, shifting collective opinion at scale. An audit of X's 'Explain this post' feature finds evidence of pro-life bias in Grok's outputs on abortion content, traced to specific design choices. The paper concludes with implications for EU legislative efforts on AI-mediated communication.

Evaluation and Benchmarking AI Safety Research Grok X (Twitter)EU AI Act +5 more

4arXiv · cs.CL·10d ago·source ↗

Calibrated LLM annotation and encoder transfer for measuring human values in social media text

A new arXiv preprint investigates how different LLMs, prompts, and instruction languages operationalize Schwartz's theory of basic human values when annotating non-English social media posts. The authors evaluate annotation quality beyond standard F1 metrics, examining structural alignment, error structure, and confidence-ambiguity relations, finding that iterative prompt calibration reduces misattributions. They also demonstrate that LLM annotations can be transferred to a smaller encoder model via soft-label training, preserving theory-grounded value interpretations and uncertainty information.

Evaluation and Benchmarking Alignment and RLHF Schwartz's Theory of Basic Human Values Measuring Human Value Expression in Social Media Texts: Calibrated LLM Annotation and Encoder Transfer

6arXiv · cs.AI·18d ago·source ↗

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

This paper identifies and analyzes 'Perceptual Judgment Bias' in multimodal LLM judges, where models anchor on response text rather than visual evidence when the two conflict. The authors introduce a Perceptually Perturbed Judgment Dataset using counterfactual responses to isolate perceptual errors, and a training framework combining GRPO-based reward modeling with batch-ranking objectives. Experiments on MLLM-as-a-Judge benchmarks show improved perceptual fidelity, ranking coherence, and alignment with human evaluation.

Evaluation and Benchmarking Alignment and RLHF Perceptually Perturbed Judgment Dataset Multimodal Large Language Models GRPO +3 more