Entity · benchmark

PhantomBench

benchmarkactivephantombench-4821bb80·1 events·first seen Jun 10, 2026

Aliases: PhantomBench

More like this (12)

SorryBench PseudoBench FeatBench RepoBench CharacterBench SelectBench TriggerBench SupraBench MemBench AdvBench OmniaBench MissionBench

Recent events (1)

6arXiv · cs.CL·Jun 10, 2026·source ↗

PhantomBench: Large-scale benchmark reveals staggering hallucination rates on non-existent concepts

PhantomBench is a new benchmark comprising over 60,000 non-existent terms and entities derived from real concepts, designed to test whether language models can recognize the limits of their knowledge. Evaluating 21 models of various types and sizes, the authors find hallucination rates as high as 86.7% on average, with even frontier models failing to abstain when inputs presuppose the existence of fabricated concepts. The benchmark also serves as a proxy for studying model behavior on rare real concepts, and includes a pipeline for scalable generation of custom non-existent concept sets.

Evaluation and Benchmarking AI Safety Research PhantomBench