Almanac
← Events
5Hugging Face Blog·1mo ago

The Hallucinations Leaderboard, an Open Effort to Measure Hallucinations in Large Language Models

Hugging Face has launched an open leaderboard specifically designed to benchmark hallucination rates across large language models. The effort aims to standardize evaluation of factual accuracy and confabulation tendencies, filling a gap in existing benchmarks that focus primarily on capability rather than reliability. The leaderboard is positioned as a community-driven, transparent resource for tracking model trustworthiness.

Related guides (3)

Related events (8)

3Hugging Face Blog·1mo ago·source ↗

Guide to Setting Up a Hugging Face Leaderboard: Vectara Hallucination Leaderboard as Example

This Hugging Face blog post provides an end-to-end tutorial on creating custom leaderboards on the Hugging Face platform, using Vectara's hallucination leaderboard as a concrete example. It covers the technical setup process for hosting evaluation leaderboards, which are increasingly important infrastructure for tracking model capabilities. The post bridges tooling and evaluation concerns by showing how third-party organizations can publish standardized benchmarks on HF.

4arXiv · cs.CL·20d ago·source ↗

BenHalluEval: Multi-Task Hallucination Evaluation Framework for Bengali LLMs

BenHalluEval introduces the first systematic hallucination benchmark for Bengali, covering four tasks (generative QA, code-mixed QA, summarization, reasoning) with 12,000 hallucinated candidates generated via GPT-5.4 across twelve hallucination types. Seven LLMs are evaluated under a dual-track protocol separating false-positive rate on ground-truth instances from hallucination detection rate on hallucinated candidates. The proposed BenHalluScore metric reveals substantial variation (7.72%–55.42%) across models and tasks, and chain-of-thought prompting is found to shift response distributions without consistently improving hallucination discrimination. The work highlights gaps in low-resource language hallucination evaluation and critiques single-track and prompting-only evaluation approaches.

5Openai Blog·1mo ago·source ↗

Why Language Models Hallucinate

OpenAI published research explaining the mechanisms behind language model hallucination. The work connects improved evaluation methods to enhanced AI reliability, honesty, and safety. The body is sparse on technical detail, but the framing positions this as foundational research relevant to alignment and deployment trust.

4arXiv · cs.CL·10d ago·source ↗

CHAIR: Supervised hallucination detection via internal logit analysis across LLM layers

A new arXiv preprint introduces CHAIR (Classifier of Hallucination As ImproveR), a supervised framework that detects hallucinations by extracting statistical features (max, min, mean, std, slope) from token logits across all layers of an LLM. Evaluated on TruthfulQA and MMLU, CHAIR shows improved detection accuracy especially in zero-shot settings. The authors argue the approach also points toward richer internal representations for designing adaptive decoding strategies that reduce hallucinations.

5arXiv · cs.AI·6d ago·source ↗

ClinHallu benchmark diagnoses stage-wise hallucinations in medical multimodal LLM reasoning

Researchers from Alibaba DAMO Academy introduce ClinHallu, a benchmark of 7,031 validated instances designed to identify where hallucinations originate within medical MLLM reasoning pipelines. Each instance is annotated with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration stages, with stage-replacement interventions to measure the causal impact of correcting each stage. The paper also demonstrates that trace-supervised fine-tuning reduces stage-wise hallucinations, offering both diagnostic and mitigation value for clinical AI systems.

6arXiv · cs.CL·11d ago·source ↗

PhantomBench: Large-scale benchmark reveals staggering hallucination rates on non-existent concepts

PhantomBench is a new benchmark comprising over 60,000 non-existent terms and entities derived from real concepts, designed to test whether language models can recognize the limits of their knowledge. Evaluating 21 models of various types and sizes, the authors find hallucination rates as high as 86.7% on average, with even frontier models failing to abstain when inputs presuppose the existence of fabricated concepts. The benchmark also serves as a proxy for studying model behavior on rare real concepts, and includes a pipeline for scalable generation of custom non-existent concept sets.

4Hugging Face Blog·1mo ago·source ↗

Introducing the Open Leaderboard for Hebrew LLMs

Hugging Face has launched an open leaderboard dedicated to evaluating large language models on Hebrew language tasks. The leaderboard aims to benchmark multilingual and Hebrew-specific models across standardized tasks to track progress in Hebrew NLP. This fills a gap in non-English language evaluation infrastructure.

5Hugging Face Blog·1mo ago·source ↗

The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare

Hugging Face has launched the Open Medical-LLM Leaderboard, a public benchmark for evaluating large language models on healthcare and medical tasks. The leaderboard aggregates performance across multiple medical question-answering datasets to enable standardized comparison of open-weight models in clinical and biomedical domains. This initiative aims to accelerate progress in medical AI by providing transparent, reproducible evaluation infrastructure.