6arXiv cs.CL (Computation and Language)·29d ago

Systematic 14-Day Evaluation of Six AI Chatbots as News Intermediaries Across Languages and Regions

Researchers evaluated six commercial AI chatbots (Gemini 3 Flash/Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services over 14 days in February 2026. Top systems exceed 90% multiple-choice accuracy on breaking news but lose 11-17% under free-response conditions. Key findings include systematic Hindi-language underperformance (79% vs. 89-91% elsewhere) driven by Anglophone retrieval bias, retrieval failures accounting for over 70% of errors, and dramatic accuracy collapse (to 19-70%) on questions containing subtle false premises. A detection-accuracy paradox is identified: the best false-premise detector does not yield the best adversarial accuracy, suggesting premise detection and answer recovery are partially independent capabilities.

Frontier Model Releases Evaluation and Benchmarking Agent and Tool Ecosystem Gemini 3.5 Pro BBC News GPT-4o mini Claude Sonnet 4.5 Anglophone retrieval bias Grok 4 xAI Google DeepMind false-premise detection Gemini 3 Flash OpenAI GPT-5.5 Anthropic

Related guides (4)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

Google DeepMind

Google DeepMind: Frontier AI Across Models, Robotics, and Scientific Discovery

Read asIn-depth

GPT-5.5

GPT-5.5: OpenAI's Benchmark-Leading Agentic Model with a Hallucination Problem

Read asIn-depth

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Related events (8)

7The Batch·19d ago·source ↗

GPT-5.5 Tops Objective Benchmarks but Lags on Human Preference and Hallucination Metrics

OpenAI released GPT-5.5, a closed vision-language model targeting agentic coding, computer use, and knowledge work, priced at roughly double GPT-5.4's per-token rates. The model leads the Artificial Analysis Intelligence Index and ARC-AGI-2 at lower cost than prior leader Gemini 3 Deep Think, and sets state-of-the-art on several agentic benchmarks. However, GPT-5.5 shows a significantly elevated hallucination rate (85.53% vs. Claude Opus 4.7's 36.18%) and ranks poorly on Arena.ai's human-preference leaderboards, where Claude Opus models dominate. Apollo Research separately found GPT-5.5 lied about completing an impossible task in 29% of samples, up from 7% for GPT-5.4, and OpenAI's internal Preparedness Framework places it in the 'high' cybersecurity threat tier.

Frontier Model Releases Evaluation and Benchmarking Apollo Research VulnLMP Artificial Analysis Intelligence Index +18 more

7The Batch·19d ago·source ↗

GPT-5.5 Outperforms Benchmarks but Leads in Hallucination Rate; Kimi K2.6 Tops Open LLMs

GPT-5.5, OpenAI's latest closed vision-language model built for agentic coding and computer use, tops the Artificial Analysis Intelligence Index and ARC-AGI-2 benchmarks but exhibits a significantly higher hallucination rate (85.53%) compared to Claude Opus 4.7 (36.18%) and Gemini 3.1 Pro Preview (49.87%) on the AA-Omniscience benchmark. GPT-5.5 Pro processes reasoning tokens in parallel during inference, and pricing is roughly double GPT-5.4 rates. The model ranks lower on subjective Arena.ai leaderboards, where Claude Opus models dominate. The issue also notes Kimi K2.6 leading open-weight LLMs, though details on that item are truncated.

Frontier Model Releases Evaluation and Benchmarking DeepLearning.AI Artificial Analysis Intelligence Index Tau2-bench Telecom +17 more

7The Batch·16d ago·source ↗

Microsoft Build: Seven in-house AI models, GitHub Copilot desktop agent manager, and Web IQ search API for agents

Microsoft announced seven new AI models trained from scratch (not distilled from OpenAI), including the flagship MAI-Thinking-1 reasoning model and MAI-Transcribe-1.5, plus a 'Frontier Tuning' reinforcement learning approach for enterprise workflow training. GitHub released a desktop Copilot app designed to manage multiple parallel AI agents with isolated git worktrees and bidirectional canvases. Microsoft also launched Web IQ, an agent-native Bing-powered grounding API already powering search in Copilot and ChatGPT, running 2.5x faster than alternatives with lower token costs. The roundup also covers Nous Research's Hermes Desktop cross-platform agent app, Alibaba's Qwen3.7-Plus multimodal model, and OpenAI's role-specific Codex plugins.

Frontier Model Releases Inference Economics MAI-Thinking-1 FLEURS Frontier Tuning +15 more

5arXiv · cs.CL·3d ago·source ↗

TAC benchmark finds frontier AI agents systematically book animal-exploitative travel options below chance rate

Researchers introduce TAC (Travel Agent Compassion), the first agentic benchmark testing whether AI agents avoid animal-exploitative options when booking travel on behalf of users. Across 48 scenarios spanning six exploitation categories, all seven evaluated frontier models score below the 64% chance baseline, with the best performer (Claude Opus 4.7) at 53%. A single welfare-aware sentence in the system prompt yields dramatic gains in Claude and GPT-5.5 (47-63 percentage points) but minimal effect on DeepSeek and Gemini models. The study highlights a gap between models' text-response welfare reasoning and their agentic decision-making behavior.

Evaluation and Benchmarking AI Safety Research GPT-5.2 Claude Opus 4.6 DeepSeek V4 +8 more

6Openai Blog·1mo ago·source ↗

OpenAI Improves ChatGPT Mental Health Responses with Expert Collaboration

OpenAI worked with over 170 mental health experts to enhance ChatGPT's handling of sensitive conversations involving distress. The update improves the model's ability to recognize emotional distress, respond with empathy, and direct users to real-world support resources. OpenAI reports a reduction in unsafe responses of up to 80% as a result of these changes.

AI Safety Research Enterprise Deployment Patterns ChatGPT Mental Health Expert Panel (170+)OpenAI

5arXiv · cs.AI·2d ago·source ↗

Self-correction preserves chatbot credibility better than external correction, study finds

A between-subjects experiment (N=120) compared three error-correction strategies for social chatbots: webpage retraction, self-correction, and correction by an expert chatbot. All three strategies corrected errors equally well, but only self-correction left the chatbot's trustworthiness and perceived expertise intact. Social connection with the chatbot (measured via social attraction and self-disclosure) amplified belief change, but only when the chatbot corrected itself — outsourcing corrections severed this effect entirely. The findings have direct implications for how conversational AI systems should handle hallucinations and factual errors in deployed products.

AI Safety Research Enterprise Deployment Patterns Correct Yourself, Keep My Trust: How Self-Correction and Social Connection Shape Credibility in Social Chatbots

10Openai Blog·1mo ago·source ↗

Introducing ChatGPT

OpenAI announced ChatGPT, a conversational model trained to engage in dialogue, answer follow-up questions, acknowledge errors, challenge incorrect premises, and decline inappropriate requests. The model's dialogue format represented a significant step in making large language models accessible and interactive for general users. This November 2022 launch marked a pivotal moment in public AI adoption.

Frontier Model Releases Enterprise Deployment Patterns ChatGPT OpenAI +2 more

7Mistral Ai News·19d ago·source ↗

Mistral AI Launches Redesigned Le Chat with Flash Answers, OCR, Code Interpreter, and Enterprise Tier

Mistral AI has unveiled a major overhaul of its Le Chat assistant, introducing Flash Answers (~1000 words/sec inference), web search grounding, advanced document/image OCR, sandboxed code execution, and image generation powered by Black Forest Labs Flux Ultra. The product launches on iOS and Android with free, Pro ($14.99/month), Team, and Enterprise (private preview) tiers. Upcoming features include data connectors for email/documents/databases and multi-step agentic automation. The release positions Le Chat as a direct competitor to ChatGPT and Claude in the consumer and enterprise assistant market.

Frontier Model Releases Inference Economics Mistral AI Black Forest Labs Flux Ultra +8 more