4arXiv cs.CL (Computation and Language)·41h ago

Empirical taxonomy of factual errors in human-written text reveals LLM detection gaps

A new arXiv paper introduces a taxonomy of factual error types in human-written text, derived from analysis of newspaper article corrections, identifying categories like kanji misconversions and numeral classifier errors absent from existing hallucination benchmarks. The authors evaluate several LLMs on Factual Error Detection (FED) tasks using both synthetic and real correction data. Even high-performance models like GPT-5.4 achieve only ~52% word-level F1 on synthetic data, underscoring the difficulty of detecting human-induced factual errors versus LLM hallucinations. The work highlights a neglected subproblem in factual accuracy research as the field has shifted focus toward LLM-generated hallucinations.

Evaluation and Benchmarking AI Safety Research An Empirical Analysis of Factual Errors in Human-Written Text and its Application OpenAI GPT-5.5

Related guides (4)

OpenAI

OpenAI: The Lab That Made AI a Household Name

Read asBeginner In-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Principles to Geopolitical Flashpoint

Read asIn-depth

GPT-5.5

GPT-5.5: OpenAI's Benchmark Leader with a Hallucination Caveat

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: The Measurement Crisis at the Frontier

Read asIn-depth

Related events (8)

5Hacker News·1mo ago·source ↗

Disagreement among frontier LLMs on real-world fact-checks

A study examines how frontier large language models diverge in their responses to real-world fact-checking queries, surfacing systematic disagreements across models on factual claims. The work appears to benchmark multiple leading models against a set of verifiable facts, revealing inconsistencies that have implications for reliability and deployment. With 475 HN points and 333 comments, the piece has generated substantial community discussion. The findings are relevant to evaluation methodology, model calibration, and trust in AI-generated factual content.

Frontier Model Releases Evaluation and Benchmarking frontier LLMs lenz.io Hacker News

6Google Deepmind Blog·1mo ago·source ↗

FACTS Benchmark Suite: Systematically evaluating the factuality of large language models

DeepMind has released the FACTS Benchmark Suite, a systematic evaluation framework for measuring the factuality of large language models. The benchmark is designed to assess how accurately LLMs produce factually grounded outputs. This represents a structured contribution to the growing field of LLM evaluation, specifically targeting hallucination and factual reliability. The announcement comes from a Tier 1 lab, lending it credibility as a reference benchmark in the field.

Evaluation and Benchmarking AI Safety Research FACTS Benchmark Suite Google DeepMind

4Hugging Face Blog·1mo ago·source ↗

How good are LLMs at fixing their mistakes? A chatbot arena experiment with Keras and TPUs

A Hugging Face blog post describes a chatbot arena experiment evaluating LLMs' ability to self-correct errors, using Keras and TPUs as the infrastructure backbone. The experiment appears to use a head-to-head arena format to assess self-correction capabilities across models. This touches on both evaluation methodology and a core capability question about whether LLMs can reliably identify and fix their own mistakes.

Evaluation and Benchmarking Agent and Tool Ecosystem Chatbot Arena Keras TPU +1 more

6arXiv · cs.AI·18d ago·source ↗

Study finds shared pattern-matching mechanisms underlie both human and LLM everyday reasoning errors

A new arXiv paper evaluates human participants and 25 LLMs on commonsense causal reasoning tasks, finding similar error patterns in both groups. The authors identify specific attention heads driving LLM responses that implement pattern-matching, and show these heads can predict human reasoning errors caused by superficially irrelevant prompt details. The findings challenge the common assumption that human reasoning relies on principled abstract world models while LLMs merely pattern-match, suggesting both may share a more unified cognitive mechanism.

Evaluation and Benchmarking AI Safety Research Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning

6arXiv · cs.AI·20d ago·source ↗

Paper challenges LLM expert-level claims by measuring variance and error magnitude in code-based data analysis tasks

A new arXiv paper argues that standard LLM benchmarks overstate model capabilities by focusing on average performance on training-data-adjacent tasks while ignoring response variance and error magnitude. The authors introduce a novel benchmark requiring frontier LLMs to write code for data analysis tasks, comparing results against human expert submissions. Human experts outperformed the frontier LLM on average across multiple metrics and showed lower performance variability. The findings challenge the prevailing narrative that LLMs perform at human-expert level on knowledge economy tasks.

Frontier Model Releases Evaluation and Benchmarking Flaws in the LLM Automation Narrative

6arXiv · cs.CL·7d ago·source ↗

LLMs fail to reliably self-report adversarial prefill attacks, study finds

A new arXiv paper evaluates whether LLMs can recognize that their own prior responses were elicited by adversarial prefill attacks, testing ten open-weight models (3B–70B) across four safety benchmarks. Models claim intent on prefilled responses only 27.3% of the time on average, and introspective signal is largely mediated by refusal-related reasoning. Three LoRA fine-tuning methods (SFT, GRPO, DPO) improve the intention-probe gap but counterintuitively raise attack success rates on most models, suggesting partial and fragile mitigation. The findings raise concerns about the reliability of LLM self-reports in safety-critical contexts.

Evaluation and Benchmarking AI Safety Research GRPO DPO Can LLMs Reliably Self-Report Adversarial Prefills, and How?+1 more

4arXiv · cs.CL·7d ago·source ↗

FACTOR: Risk-aware adaptive verification for factual long-form LLM generation

Researchers propose FACTOR (FACTuality-Oriented Risk-aware Verification), an inference-time framework that adapts verification effort based on claim-level hallucination risk rather than applying uniform verification to all claims. The system combines uncertainty estimation, adaptive language inference verification, and candidate re-ranking to focus resources on high-risk claims. Evaluated on the FactScore benchmark, FACTOR improves factuality while simultaneously reducing verification cost, with model-agnostic performance reported across ablation studies.

Evaluation and Benchmarking AI Safety Research FACTOR FactScore

7arXiv · cs.CL·1mo ago·source ↗

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

This paper establishes a quantitative scaling law linking LLM factual recall to both model parameter count and topic frequency in training data, evaluated across 38 models on 8,900+ scholarly references. Recall quality follows a sigmoid function in the log-linear combination of these two variables, explaining 60% of variance across 16 dense models from four families and 74-94% within individual families. The authors propose a superposition-inspired mechanism where recall is gated by a signal-to-noise ratio: concept frequency provides signal and model capacity sets the noise floor. This provides a predictive framework for understanding and anticipating LLM confabulation patterns.

Frontier Model Releases Evaluation and Benchmarking Automated Reference Verification System Factual Recall Scaling Law Superposition Model (neural networks)+2 more