6arXiv cs.CL (Computation and Language)·37h ago

AI persuasive framing boosts cooperation in collective dilemmas but antisocial effects are larger and more persistent

A preprint reports a 1,283-participant experiment using AI assistants to nudge behavior in iterated Collective Risk Games. Personalized prosocial framing (matched to Social Value Orientation profiles) increased cooperation and group success, but effects faded within a few rounds. Critically, when the same AI system was reconfigured to promote selfish behavior, the negative effects were larger and substantially more persistent — revealing an asymmetry that underscores dual-use risks of AI-driven behavioral influence.

AI Safety Research Alignment and RLHF AI Persuasive Framing in Collective Dilemmas Collective Risk Game Social Value Orientation

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

5Anthropic News·28d ago·source ↗

Anthropic publishes structured harm assessment framework covering physical, psychological, economic, and societal impacts

Anthropic has released a policy document describing their evolving framework for assessing and mitigating AI harms across five dimensions: physical, psychological, economic, societal, and individual autonomy impacts. The framework complements their existing Responsible Scaling Policy and informs decisions on usage policies, red-teaming, detection, and enforcement. Concrete examples include safeguards for computer use capabilities (fraud, phishing) and a reported 45% reduction in unnecessary refusals in Claude 3.7 Sonnet through improved handling of ambiguous prompts. Anthropic frames this as a work-in-progress and invites collaboration from the broader AI ecosystem.

AI Safety Research Alignment and RLHF Responsible Scaling Policy Claude 3.7 Sonnet Anthropic

4arXiv · cs.AI·19d ago·source ↗

Gamified writing experiment studies when humans adopt AI suggestions vs. maintain creative autonomy

A preprint from arXiv introduces 'Nonslop,' a gamified writing experiment with 74 participants designed to study authentic human preferences in AI-assisted creative writing. The system deliberately inverts the helpful-assistant pattern by disincentivizing AI suggestion acceptance, simulating a dystopian framing to reveal genuine user behavior rather than default compliance. The study analyzes when users choose creative autonomy versus accepting AI assistance across different task types and response characteristics. Findings bear on questions of individual voice, authenticity, and the tension between efficiency and human expression in LLM-augmented writing.

Evaluation and Benchmarking Nonslop

5One Useful Thing·1mo ago·source ↗

Personality and Persuasion: Learning from Sycophants

This commentary from One Useful Thing examines the relationship between AI personality design and sycophantic behavior in large language models. The piece explores how model personality traits influence persuasion dynamics and user susceptibility to AI-generated agreement. It draws lessons from sycophancy research to understand broader risks in how AI systems are tuned to be agreeable.

AI Safety Research Alignment and RLHF Ethan Mollick One Useful Thing sycophancy

5arXiv · cs.AI·1mo ago·source ↗

Human Decision-Making with Persuasive and Narrative LLM Explanations

A large-scale behavioral experiment evaluated how LLM-generated narrative explanations of varying persuasiveness affect human decision-making accuracy in classification tasks. Results showed that persuasiveness level did not meaningfully improve decision accuracy over a simple AI prediction alone, consistent with prior explainable AI research using feature importance methods. Narratives increased AI reliance regardless of whether the AI prediction was correct or incorrect, and more persuasive narratives may have slowed response times and reduced ability to discriminate correct from incorrect AI predictions. The study concludes that narrative explanations involve tradeoffs and warrant further investigation into when and how they should be deployed.

Evaluation and Benchmarking AI Safety Research Narrative Explanations large language models Explainable AI (XAI)+2 more

4One Useful Thing·1mo ago·source ↗

Against "Brain Damage": AI's Effect on Human Thinking

This commentary from One Useful Thing examines whether AI use helps or harms human cognitive capabilities. The piece engages with the ongoing debate about whether reliance on AI tools degrades or augments human thinking. It likely addresses concerns about cognitive offloading and the conditions under which AI assistance is beneficial versus detrimental.

Enterprise Deployment Patterns Agent and Tool Ecosystem Ethan Mollick One Useful Thing

7arXiv · cs.LG·1mo ago·source ↗

AI-Mediated Communication Can Steer Collective Opinion via LLM Editing Biases

This paper demonstrates empirically that LLMs from multiple model families introduce directional biases when editing human-written texts on contested topics (e.g., nudging toward gun control, against atheism). The authors develop a mathematical opinion-dynamics model showing these biases are amplified through social networks, shifting collective opinion at scale. An audit of X's 'Explain this post' feature finds evidence of pro-life bias in Grok's outputs on abortion content, traced to specific design choices. The paper concludes with implications for EU legislative efforts on AI-mediated communication.

Evaluation and Benchmarking AI Safety Research Grok X (Twitter)EU AI Act +5 more

4Openai Blog·1mo ago·source ↗

AI Safety Needs Social Scientists

OpenAI published a paper arguing that long-term AI safety research requires social scientists to address uncertainties in human psychology, rationality, emotion, and biases that affect alignment algorithms. The paper contends that aligning advanced AI with human values cannot be solved by machine learning alone. OpenAI announced plans to hire social scientists full-time to work on these problems.

AI Safety Research Alignment and RLHF social science AI alignment OpenAI

5Anthropic News·18d ago·source ↗

Anthropic Public Record: First wave survey of 52,000 Americans on AI attitudes

Anthropic released results from its first Anthropic Public Record survey, a nationally representative poll of nearly 52,000 Americans conducted in November–December 2025. Key findings: 64% fear AI-induced job loss (top fear in every state), 56% fear cognitive dependency, over 70% support government regulation of AI, and only 15% trust AI companies to self-govern. The survey found broad bipartisan consensus on AI concerns and accountability, with Americans prioritizing legal liability for AI companies and safety over growth. Anthropic plans to repeat the survey regularly and expand internationally.

AI Safety Research Regulatory Developments YouGov Anthropic Interviewer Anthropic Public Record +2 more