benchmark

FEVER

benchmarkactiveprovisionalfever-e2c8edad·1 events·first seen 21h ago

Aliases: FEVER

Co-occurring entities

Warranted Supports Proportion SIFT 5PILS SciFact

More like this (12)

FAME STORMS FID IFEval PEFT FACTOR BERT-F1 SPEAR FEniCS LEAF-X FAST Influcoder

Recent events (1)

5arXiv · cs.CL·21h ago·source ↗

SIFT and WSP: Claim-conditioned re-scoring to close the warrant gap in LLM fact-checking

A new arXiv preprint identifies a 'warrant gap' in LLM-based fact-checking systems: models frequently output Supports verdicts whose cited evidence does not actually entail the claim. The authors introduce SIFT, a claim-conditioned re-scoring method for extracted evidence spans, and WSP (Warranted Supports Proportion), an automatic NLI-based metric that checks whether cited warrants entail the claim. Evaluated on FEVER, SciFact, 5PILS, and DP across four open-source backbones, SIFT recovers up to 27.6 accuracy points lost by naive decomposition, while WSP calibrates against human gold evidence at AUC 0.92 and precision 0.98.

Evaluation and Benchmarking AI Safety Research Warranted Supports Proportion SIFT FEVER +2 more