Almanac
benchmark

UniEval

benchmarkactiveprovisionalunieval-4c325564·1 events·first seen 2d ago

Aliases: UniEval

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.CL·2d ago·source ↗

BINEVAL: Binary question decomposition for interpretable LLM evaluation and prompt optimization

Researchers introduce BINEVAL, a framework that decomposes LLM evaluation criteria into atomic binary yes/no questions, aggregating answers into multi-dimensional interpretable scores. The approach matches or outperforms baselines including UniEval and G-Eval on SummEval, Topical-Chat, and QAGS benchmarks, with particular strength on factual consistency. Beyond evaluation, the binary question feedback is shown to support iterative prompt optimization in both self-update and cross-model settings on IFBench. The framework is training-free and task-agnostic, addressing opacity and ceiling-effect problems common in holistic LLM judges.