Almanac
benchmark

FEVER

benchmarkactiveprovisionalfever-e2c8edad·1 events·first seen 21h ago

Aliases: FEVER

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.CL·21h ago·source ↗

SIFT and WSP: Claim-conditioned re-scoring to close the warrant gap in LLM fact-checking

A new arXiv preprint identifies a 'warrant gap' in LLM-based fact-checking systems: models frequently output Supports verdicts whose cited evidence does not actually entail the claim. The authors introduce SIFT, a claim-conditioned re-scoring method for extracted evidence spans, and WSP (Warranted Supports Proportion), an automatic NLI-based metric that checks whether cited warrants entail the claim. Evaluated on FEVER, SciFact, 5PILS, and DP across four open-source backbones, SIFT recovers up to 27.6 accuracy points lost by naive decomposition, while WSP calibrates against human gold evidence at AUC 0.92 and precision 0.98.