benchmark

UniEval

benchmarkactiveprovisionalunieval-4c325564·1 events·first seen 2d ago

Aliases: UniEval

Co-occurring entities

IFBench G-Eval SummEval BINEVAL QAGS

More like this (12)

ParaEval ValueEval SummEval Every Eval Ever CharacterEval Uni-1 HumanEval T-Eval Unily olmo-eval TweetEval G-Eval

Recent events (1)

5arXiv · cs.CL·2d ago·source ↗

BINEVAL: Binary question decomposition for interpretable LLM evaluation and prompt optimization

Researchers introduce BINEVAL, a framework that decomposes LLM evaluation criteria into atomic binary yes/no questions, aggregating answers into multi-dimensional interpretable scores. The approach matches or outperforms baselines including UniEval and G-Eval on SummEval, Topical-Chat, and QAGS benchmarks, with particular strength on factual consistency. Beyond evaluation, the binary question feedback is shown to support iterative prompt optimization in both self-update and cross-model settings on IFBench. The framework is training-free and task-agnostic, addressing opacity and ceiling-effect problems common in holistic LLM judges.

Evaluation and Benchmarking Alignment and RLHF IFBench G-Eval SummEval +3 more