Almanac
benchmark

MATH-MCQA

benchmarkactiveprovisionalmath-mcqa-c58b8f0b·1 events·first seen 37h ago

Aliases: MATH-MCQA

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·37h ago·source ↗

Uncertainty-Based Decontamination (UBD) framework for removing benchmark contamination from LLMs

Researchers propose Uncertainty-Based Decontamination (UBD), a method that uses deep ensembles of a contaminated model to estimate per-sample memorization and correct for benchmark data contamination without requiring access to an uncontaminated reference model. The approach introduces a sample-level evaluation framework using distributional distance metrics alongside aggregate accuracy to better characterize decontamination quality. Experiments on MMLU-Pro and MATH-MCQA show UBD produces output distributions closer to uncontaminated baselines than paraphrasing or choice-permutation methods. The work addresses a significant validity concern in LLM evaluation, where contamination inflates reported benchmark performance.