MMBench2 paper: hallucination in world models is predictable and preventable via coverage signals
Researchers introduce MMBench2, a 427-hour, 210-task dataset for visual world modeling, and train a 350M-parameter world model to study hallucination in generative world models. The paper identifies three distinct hallucination modes (perceptual, action-marginalized, scene-diverging) and develops lightweight signals that predict where models will fail. A coverage-aware sampling technique and curiosity-reward-based data collection enable efficient finetuning to unseen environments with as few as 50 real trajectories. The central finding is that world model hallucination is fundamentally a data coverage problem, with the same signals serving both detection and mitigation.
Related guides (1)
Related events (8)
PhantomBench: Large-scale benchmark reveals staggering hallucination rates on non-existent concepts
PhantomBench is a new benchmark comprising over 60,000 non-existent terms and entities derived from real concepts, designed to test whether language models can recognize the limits of their knowledge. Evaluating 21 models of various types and sizes, the authors find hallucination rates as high as 86.7% on average, with even frontier models failing to abstain when inputs presuppose the existence of fabricated concepts. The benchmark also serves as a proxy for studying model behavior on rare real concepts, and includes a pipeline for scalable generation of custom non-existent concept sets.
BenHalluEval: Multi-Task Hallucination Evaluation Framework for Bengali LLMs
BenHalluEval introduces the first systematic hallucination benchmark for Bengali, covering four tasks (generative QA, code-mixed QA, summarization, reasoning) with 12,000 hallucinated candidates generated via GPT-5.4 across twelve hallucination types. Seven LLMs are evaluated under a dual-track protocol separating false-positive rate on ground-truth instances from hallucination detection rate on hallucinated candidates. The proposed BenHalluScore metric reveals substantial variation (7.72%–55.42%) across models and tasks, and chain-of-thought prompting is found to shift response distributions without consistently improving hallucination discrimination. The work highlights gaps in low-resource language hallucination evaluation and critiques single-track and prompting-only evaluation approaches.
CHAIR: Supervised hallucination detection via internal logit analysis across LLM layers
A new arXiv preprint introduces CHAIR (Classifier of Hallucination As ImproveR), a supervised framework that detects hallucinations by extracting statistical features (max, min, mean, std, slope) from token logits across all layers of an LLM. Evaluated on TruthfulQA and MMLU, CHAIR shows improved detection accuracy especially in zero-shot settings. The authors argue the approach also points toward richer internal representations for designing adaptive decoding strategies that reduce hallucinations.
ClinHallu benchmark diagnoses stage-wise hallucinations in medical multimodal LLM reasoning
Researchers from Alibaba DAMO Academy introduce ClinHallu, a benchmark of 7,031 validated instances designed to identify where hallucinations originate within medical MLLM reasoning pipelines. Each instance is annotated with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration stages, with stage-replacement interventions to measure the causal impact of correcting each stage. The paper also demonstrates that trace-supervised fine-tuning reduces stage-wise hallucinations, offering both diagnostic and mitigation value for clinical AI systems.
LegalHalluLens: Typed hallucination auditing and calibrated multi-agent debate for legal AI
Researchers introduce LegalHalluLens, an auditing framework for hallucination in legal AI systems, evaluated across 510 contracts and 249,252 clause-level instances from the CUAD dataset. The framework introduces typed hallucination profiles across four claim categories (numeric, temporal, obligation/entitlement, factual) and a Risk Direction Index (RDI) that distinguishes omission from invention errors. A calibrated multi-agent debate pipeline reduces fabricated detections by 45% using a 4B-parameter model competitive with commercial APIs. The work reveals that aggregate hallucination rates (~52%) mask a 38-40 percentage-point gap between claim types and that two systems with identical aggregate rates can have opposite risk profiles.
Grad Detect: gradient-based hallucination detection using layer-wise backward pass signals
Grad Detect is a new method for detecting LLM hallucinations by analyzing layer-wise gradient patterns from a single forward-backward pass at inference time, without relying on output-level signals alone. Evaluated across Q&A benchmarks and eleven models from four architectural families, it consistently outperforms confidence-based and sampling-based baselines. A key finding is that the final five layers concentrate over 97% of the discriminative gradient signal, enabling efficient deployment. The method also supports model abstention prediction, framing it as a unified reliability framework.
The Hallucinations Leaderboard, an Open Effort to Measure Hallucinations in Large Language Models
Hugging Face has launched an open leaderboard specifically designed to benchmark hallucination rates across large language models. The effort aims to standardize evaluation of factual accuracy and confabulation tendencies, filling a gap in existing benchmarks that focus primarily on capability rather than reliability. The leaderboard is positioned as a community-driven, transparent resource for tracking model trustworthiness.
Why Language Models Hallucinate
OpenAI published research explaining the mechanisms behind language model hallucination. The work connects improved evaluation methods to enhanced AI reliability, honesty, and safety. The body is sparse on technical detail, but the framing positions this as foundational research relevant to alignment and deployment trust.
