Entity · benchmark

BenHalluEval

benchmarkactivebenhallueval-e306dd07·1 events·first seen Jun 1, 2026

Aliases: BenHalluEval

Co-occurring entities

BenHalluScore chain-of-thought prompting Bengali GPT-5.5

More like this (12)

BenHalluScore ClinHallu LegalHalluLens ParaEval ValueEval L-Eval HumanEval HypoEval SummEval DeepEval CharacterEval HalluTruthQA

Recent events (1)

4arXiv · cs.CL·Jun 1, 2026·source ↗

BenHalluEval: Multi-Task Hallucination Evaluation Framework for Bengali LLMs

BenHalluEval introduces the first systematic hallucination benchmark for Bengali, covering four tasks (generative QA, code-mixed QA, summarization, reasoning) with 12,000 hallucinated candidates generated via GPT-5.4 across twelve hallucination types. Seven LLMs are evaluated under a dual-track protocol separating false-positive rate on ground-truth instances from hallucination detection rate on hallucinated candidates. The proposed BenHalluScore metric reveals substantial variation (7.72%–55.42%) across models and tasks, and chain-of-thought prompting is found to shift response distributions without consistently improving hallucination discrimination. The work highlights gaps in low-resource language hallucination evaluation and critiques single-track and prompting-only evaluation approaches.

Evaluation and Benchmarking BenHalluScore chain-of-thought prompting Bengali +2 more