6OpenAI Blog·1mo ago

TruthfulQA: Measuring how models mimic human falsehoods

OpenAI introduced TruthfulQA, a benchmark designed to measure whether language models generate truthful answers or mimic common human misconceptions and falsehoods. The benchmark tests models on questions where humans frequently give wrong answers due to misconceptions, conspiracy theories, or false beliefs. Results showed that larger models were not necessarily more truthful, and in some cases performed worse, highlighting a key alignment challenge.

Evaluation and Benchmarking AI Safety Research Alignment and RLHF TruthfulQA Stephanie Lin Jacob Hilton Owain Evans OpenAI

Related guides (3)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

5Openai Blog·1mo ago·source ↗

Introducing SimpleQA: OpenAI's Factuality Benchmark for Language Models

OpenAI has released SimpleQA, a benchmark designed to measure language model factuality on short, fact-seeking questions. The benchmark targets a specific and well-defined capability: answering direct factual queries accurately. It is intended to provide a clean signal on model truthfulness and calibration for this class of questions.

Evaluation and Benchmarking AI Safety Research SimpleQA OpenAI

6Google Deepmind Blog·1mo ago·source ↗

FACTS Benchmark Suite: Systematically evaluating the factuality of large language models

DeepMind has released the FACTS Benchmark Suite, a systematic evaluation framework for measuring the factuality of large language models. The benchmark is designed to assess how accurately LLMs produce factually grounded outputs. This represents a structured contribution to the growing field of LLM evaluation, specifically targeting hallucination and factual reliability. The announcement comes from a Tier 1 lab, lending it credibility as a reference benchmark in the field.

Evaluation and Benchmarking AI Safety Research FACTS Benchmark Suite Google DeepMind

6Openai Blog·1mo ago·source ↗

How Confessions Can Keep Language Models Honest

OpenAI researchers are developing a training method called 'confessions' that teaches language models to explicitly admit when they have made mistakes or behaved undesirably. The approach aims to improve honesty, transparency, and user trust in model outputs. This represents an alignment-oriented intervention targeting self-reporting of model failures.

AI Safety Research Alignment and RLHF Confessions (training method)OpenAI

5Hacker News·23d ago·source ↗

Disagreement among frontier LLMs on real-world fact-checks

A study examines how frontier large language models diverge in their responses to real-world fact-checking queries, surfacing systematic disagreements across models on factual claims. The work appears to benchmark multiple leading models against a set of verifiable facts, revealing inconsistencies that have implications for reliability and deployment. With 475 HN points and 333 comments, the piece has generated substantial community discussion. The findings are relevant to evaluation methodology, model calibration, and trust in AI-generated factual content.

Frontier Model Releases Evaluation and Benchmarking frontier LLMs lenz.io Hacker News

5Openai Blog·1mo ago·source ↗

Teaching Models to Express Their Uncertainty in Words

OpenAI published research on training language models to verbally express their own uncertainty rather than stating answers with uniform confidence. The work explores calibration of model outputs through natural language hedging, aiming to make models more honest about what they do and do not know. This is an early contribution to the broader alignment and calibration research agenda.

Evaluation and Benchmarking Alignment and RLHF Verbal Uncertainty Expression Uncertainty Calibration OpenAI

5arXiv · cs.CL·22d ago·source ↗

CommunityFact: A Dynamic, Multilingual, Multi-domain Benchmark for Misinformation Detection in the Wild

CommunityFact is a refreshable benchmark for misinformation detection containing 15,992 standalone claims across five languages and two domains, designed to address limitations of static benchmarks. The authors evaluate ten LLMs under varying inference-time conditions including chain-of-thought reasoning and web-search augmentation, finding that web access yields the largest performance gains. A key finding is that web-enabled LLMs' source-selection policies are systematically misaligned with sources that human Community Notes raters converge on, a gap addressable through retrieval expansion or pruning. The benchmark also proposes using Community Notes as a training signal for claim-conditioned source suggesters.

Evaluation and Benchmarking Agent and Tool Ecosystem large language models Community Notes CommunityFact

5Openai Blog·1mo ago·source ↗

Why Language Models Hallucinate

OpenAI published research explaining the mechanisms behind language model hallucination. The work connects improved evaluation methods to enhanced AI reliability, honesty, and safety. The body is sparse on technical detail, but the framing positions this as foundational research relevant to alignment and deployment trust.

Evaluation and Benchmarking AI Safety Research hallucination (LLM)OpenAI +1 more

6arXiv · cs.CL·11d ago·source ↗

ParaEval framework reduces MCQA benchmark sensitivity to answer phrasing

A new arXiv preprint identifies a systematic flaw in multiple-choice QA benchmarks: log-likelihood scoring conflates surface-form familiarity with actual capability, producing false performance gaps exceeding 2 points between models trained on identical knowledge. The authors propose ParaEval, which queries models with multiple paraphrases per answer option and scores on the most favorable phrasing, reducing the false gap to below 1 point. The effect is confirmed on frontier 70B and 120B open-source models, suggesting widespread benchmark inflation in standard MCQA evaluations.

Evaluation and Benchmarking ParaEval