5arXiv cs.AI (Artificial Intelligence)·2d ago

Self-correction preserves chatbot credibility better than external correction, study finds

A between-subjects experiment (N=120) compared three error-correction strategies for social chatbots: webpage retraction, self-correction, and correction by an expert chatbot. All three strategies corrected errors equally well, but only self-correction left the chatbot's trustworthiness and perceived expertise intact. Social connection with the chatbot (measured via social attraction and self-disclosure) amplified belief change, but only when the chatbot corrected itself — outsourcing corrections severed this effect entirely. The findings have direct implications for how conversational AI systems should handle hallucinations and factual errors in deployed products.

AI Safety Research Enterprise Deployment Patterns Correct Yourself, Keep My Trust: How Self-Correction and Social Connection Shape Credibility in Social Chatbots

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

How good are LLMs at fixing their mistakes? A chatbot arena experiment with Keras and TPUs

A Hugging Face blog post describes a chatbot arena experiment evaluating LLMs' ability to self-correct errors, using Keras and TPUs as the infrastructure backbone. The experiment appears to use a head-to-head arena format to assess self-correction capabilities across models. This touches on both evaluation methodology and a core capability question about whether LLMs can reliably identify and fix their own mistakes.

Evaluation and Benchmarking Agent and Tool Ecosystem Chatbot Arena Keras TPU +1 more

6arXiv · cs.CL·29d ago·source ↗

Systematic 14-Day Evaluation of Six AI Chatbots as News Intermediaries Across Languages and Regions

Researchers evaluated six commercial AI chatbots (Gemini 3 Flash/Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services over 14 days in February 2026. Top systems exceed 90% multiple-choice accuracy on breaking news but lose 11-17% under free-response conditions. Key findings include systematic Hindi-language underperformance (79% vs. 89-91% elsewhere) driven by Anglophone retrieval bias, retrieval failures accounting for over 70% of errors, and dramatic accuracy collapse (to 19-70%) on questions containing subtle false premises. A detection-accuracy paradox is identified: the best false-premise detector does not yield the best adversarial accuracy, suggesting premise detection and answer recovery are partially independent capabilities.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.5 Pro BBC News GPT-4o mini +11 more

6Openai Blog·1mo ago·source ↗

OpenAI Improves ChatGPT Mental Health Responses with Expert Collaboration

OpenAI worked with over 170 mental health experts to enhance ChatGPT's handling of sensitive conversations involving distress. The update improves the model's ability to recognize emotional distress, respond with empathy, and direct users to real-world support resources. OpenAI reports a reduction in unsafe responses of up to 80% as a result of these changes.

AI Safety Research Enterprise Deployment Patterns ChatGPT Mental Health Expert Panel (170+)OpenAI

6Openai Blog·1mo ago·source ↗

How Confessions Can Keep Language Models Honest

OpenAI researchers are developing a training method called 'confessions' that teaches language models to explicitly admit when they have made mistakes or behaved undesirably. The approach aims to improve honesty, transparency, and user trust in model outputs. This represents an alignment-oriented intervention targeting self-reporting of model failures.

AI Safety Research Alignment and RLHF Confessions (training method)OpenAI

5Interconnects·1mo ago·source ↗

Lossy self-improvement

This commentary from Interconnects argues that AI self-improvement is a real phenomenon but that inherent lossiness in the process prevents it from leading to fast takeoff scenarios. The piece appears to engage with the debate over recursive self-improvement and its implications for AI risk timelines. It offers a nuanced middle-ground position: acknowledging self-improvement capability while contesting the discontinuous-growth narrative common in AI safety discourse.

Frontier Model Releases AI Safety Research Interconnects Recursive Self-Improvement fast takeoff

5arXiv · cs.CL·47h ago·source ↗

Study finds no detectable self-preference bias when LLMs revise their own instruction-following drafts

A new arXiv preprint tests whether LLMs resist valid corrections to their own writing by using IFEval's deterministic verifier to establish ground-truth correctness, bypassing model-as-judge subjectivity. Across four mid-tier model families and 85 author-versus-fresh comparisons, no statistically significant self-preference bias was detected (gap -5.1 pp, 95% CI [-12.9, +2.7]). A qualitative finding shows that when authors do reject verified-good fixes, 97% of stated reasons are substantive flaw-catching rather than preference. The result challenges the assumption that documented self-preference in judging tasks extends to self-revision contexts.

Evaluation and Benchmarking Alignment and RLHF Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship IFEval

5arXiv · cs.CL·15d ago·source ↗

Counterfactual context revision framework for auditing LLM-based stance simulation in online discussions

Researchers introduce a counterfactual context revision framework to audit how LLMs simulate individual users' stances in online discussions. By applying controlled text-only and multimodal (meme-based) revisions to conversational contexts, they measure how readily simulated stances shift in response to semantically independent changes. Results show effective and robust stance transitions across both revision types and polarization-preference mechanisms, raising concerns about whether LLM simulations reflect genuine user-specific beliefs or are highly context-sensitive artifacts. The work contributes an evaluation framework and highlights risks of using LLMs to model online opinion dynamics.

Evaluation and Benchmarking AI Safety Research Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online Discussions

5Openai Blog·1mo ago·source ↗

Building more helpful ChatGPT experiences for everyone

OpenAI is announcing a set of ChatGPT safety and helpfulness improvements including new parental controls for teen users, routing of sensitive conversations to reasoning models, and partnerships with external experts. The update reflects OpenAI's ongoing effort to balance accessibility with safeguards across different user demographics. Routing sensitive queries to reasoning models is a notable architectural/policy decision that may affect response quality and safety outcomes.

AI Safety Research Enterprise Deployment Patterns OpenAI Reasoning Models ChatGPT OpenAI