Almanac
benchmark

PhantomBench

benchmarkactiveprovisionalphantombench-4821bb80·1 events·first seen 7d ago

Aliases: PhantomBench

More like this (12)

Recent events (1)

6arXiv · cs.CL·7d ago·source ↗

PhantomBench: Large-scale benchmark reveals staggering hallucination rates on non-existent concepts

PhantomBench is a new benchmark comprising over 60,000 non-existent terms and entities derived from real concepts, designed to test whether language models can recognize the limits of their knowledge. Evaluating 21 models of various types and sizes, the authors find hallucination rates as high as 86.7% on average, with even frontier models failing to abstain when inputs presuppose the existence of fabricated concepts. The benchmark also serves as a proxy for studying model behavior on rare real concepts, and includes a pipeline for scalable generation of custom non-existent concept sets.