6arXiv cs.LG (Machine Learning)·4d ago

MMBench2 paper: hallucination in world models is predictable and preventable via coverage signals

Researchers introduce MMBench2, a 427-hour, 210-task dataset for visual world modeling, and train a 350M-parameter world model to study hallucination in generative world models. The paper identifies three distinct hallucination modes (perceptual, action-marginalized, scene-diverging) and develops lightweight signals that predict where models will fail. A coverage-aware sampling technique and curiosity-reward-based data collection enable efficient finetuning to unseen environments with as few as 50 real trajectories. The central finding is that world model hallucination is fundamentally a data coverage problem, with the same signals serving both detection and mitigation.

Evaluation and Benchmarking Nicklas Hansen MMBench2 Hallucination in World Models is Predictable and Preventable

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·20d ago·source ↗

PhantomBench: Large-scale benchmark reveals staggering hallucination rates on non-existent concepts

PhantomBench is a new benchmark comprising over 60,000 non-existent terms and entities derived from real concepts, designed to test whether language models can recognize the limits of their knowledge. Evaluating 21 models of various types and sizes, the authors find hallucination rates as high as 86.7% on average, with even frontier models failing to abstain when inputs presuppose the existence of fabricated concepts. The benchmark also serves as a proxy for studying model behavior on rare real concepts, and includes a pipeline for scalable generation of custom non-existent concept sets.

Evaluation and Benchmarking AI Safety Research PhantomBench

4arXiv · cs.CL·29d ago·source ↗

BenHalluEval: Multi-Task Hallucination Evaluation Framework for Bengali LLMs

BenHalluEval introduces the first systematic hallucination benchmark for Bengali, covering four tasks (generative QA, code-mixed QA, summarization, reasoning) with 12,000 hallucinated candidates generated via GPT-5.4 across twelve hallucination types. Seven LLMs are evaluated under a dual-track protocol separating false-positive rate on ground-truth instances from hallucination detection rate on hallucinated candidates. The proposed BenHalluScore metric reveals substantial variation (7.72%–55.42%) across models and tasks, and chain-of-thought prompting is found to shift response distributions without consistently improving hallucination discrimination. The work highlights gaps in low-resource language hallucination evaluation and critiques single-track and prompting-only evaluation approaches.

Evaluation and Benchmarking BenHalluScore chain-of-thought prompting Bengali +2 more

4arXiv · cs.CL·19d ago·source ↗

CHAIR: Supervised hallucination detection via internal logit analysis across LLM layers

A new arXiv preprint introduces CHAIR (Classifier of Hallucination As ImproveR), a supervised framework that detects hallucinations by extracting statistical features (max, min, mean, std, slope) from token logits across all layers of an LLM. Evaluated on TruthfulQA and MMLU, CHAIR shows improved detection accuracy especially in zero-shot settings. The authors argue the approach also points toward richer internal representations for designing adaptive decoding strategies that reduce hallucinations.

Evaluation and Benchmarking AI Safety Research TruthfulQA CHAIR MMLU

5arXiv · cs.AI·15d ago·source ↗

ClinHallu benchmark diagnoses stage-wise hallucinations in medical multimodal LLM reasoning

Researchers from Alibaba DAMO Academy introduce ClinHallu, a benchmark of 7,031 validated instances designed to identify where hallucinations originate within medical MLLM reasoning pipelines. Each instance is annotated with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration stages, with stage-replacement interventions to measure the causal impact of correcting each stage. The paper also demonstrates that trace-supervised fine-tuning reduces stage-wise hallucinations, offering both diagnostic and mitigation value for clinical AI systems.

Evaluation and Benchmarking AI Safety Research Alibaba DAMO Academy ClinHallu +1 more

6arXiv · cs.CL·13d ago·source ↗

LegalHalluLens: Typed hallucination auditing and calibrated multi-agent debate for legal AI

Researchers introduce LegalHalluLens, an auditing framework for hallucination in legal AI systems, evaluated across 510 contracts and 249,252 clause-level instances from the CUAD dataset. The framework introduces typed hallucination profiles across four claim categories (numeric, temporal, obligation/entitlement, factual) and a Risk Direction Index (RDI) that distinguishes omission from invention errors. A calibrated multi-agent debate pipeline reduces fabricated detections by 45% using a 4B-parameter model competitive with commercial APIs. The work reveals that aggregate hallucination rates (~52%) mask a 38-40 percentage-point gap between claim types and that two systems with identical aggregate rates can have opposite risk profiles.

Evaluation and Benchmarking AI Safety Research LegalHalluLens CUAD Risk Direction Index +1 more

6arXiv · cs.AI·6d ago·source ↗

Grad Detect: gradient-based hallucination detection using layer-wise backward pass signals

Grad Detect is a new method for detecting LLM hallucinations by analyzing layer-wise gradient patterns from a single forward-backward pass at inference time, without relying on output-level signals alone. Evaluated across Q&A benchmarks and eleven models from four architectural families, it consistently outperforms confidence-based and sampling-based baselines. A key finding is that the final five layers concentrate over 97% of the discriminative gradient signal, enabling efficient deployment. The method also supports model abstention prediction, framing it as a unified reliability framework.

Evaluation and Benchmarking AI Safety Research Grad Detect

5Hugging Face Blog·1mo ago·source ↗

The Hallucinations Leaderboard, an Open Effort to Measure Hallucinations in Large Language Models

Hugging Face has launched an open leaderboard specifically designed to benchmark hallucination rates across large language models. The effort aims to standardize evaluation of factual accuracy and confabulation tendencies, filling a gap in existing benchmarks that focus primarily on capability rather than reliability. The leaderboard is positioned as a community-driven, transparent resource for tracking model trustworthiness.

Evaluation and Benchmarking AI Safety Research Hugging Face Hallucinations Leaderboard

5Openai Blog·1mo ago·source ↗

Why Language Models Hallucinate

OpenAI published research explaining the mechanisms behind language model hallucination. The work connects improved evaluation methods to enhanced AI reliability, honesty, and safety. The body is sparse on technical detail, but the framing positions this as foundational research relevant to alignment and deployment trust.

Evaluation and Benchmarking AI Safety Research hallucination (LLM)OpenAI +1 more