5arXiv cs.CL (Computation and Language)·21h ago

SIFT and WSP: Claim-conditioned re-scoring to close the warrant gap in LLM fact-checking

A new arXiv preprint identifies a 'warrant gap' in LLM-based fact-checking systems: models frequently output Supports verdicts whose cited evidence does not actually entail the claim. The authors introduce SIFT, a claim-conditioned re-scoring method for extracted evidence spans, and WSP (Warranted Supports Proportion), an automatic NLI-based metric that checks whether cited warrants entail the claim. Evaluated on FEVER, SciFact, 5PILS, and DP across four open-source backbones, SIFT recovers up to 27.6 accuracy points lost by naive decomposition, while WSP calibrates against human gold evidence at AUC 0.92 and precision 0.98.

Evaluation and Benchmarking AI Safety Research Warranted Supports Proportion SIFT FEVER 5PILS SciFact

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5Hacker News·27d ago·source ↗

Disagreement among frontier LLMs on real-world fact-checks

A study examines how frontier large language models diverge in their responses to real-world fact-checking queries, surfacing systematic disagreements across models on factual claims. The work appears to benchmark multiple leading models against a set of verifiable facts, revealing inconsistencies that have implications for reliability and deployment. With 475 HN points and 333 comments, the piece has generated substantial community discussion. The findings are relevant to evaluation methodology, model calibration, and trust in AI-generated factual content.

Frontier Model Releases Evaluation and Benchmarking frontier LLMs lenz.io Hacker News

5arXiv · cs.CL·26d ago·source ↗

CommunityFact: A Dynamic, Multilingual, Multi-domain Benchmark for Misinformation Detection in the Wild

CommunityFact is a refreshable benchmark for misinformation detection containing 15,992 standalone claims across five languages and two domains, designed to address limitations of static benchmarks. The authors evaluate ten LLMs under varying inference-time conditions including chain-of-thought reasoning and web-search augmentation, finding that web access yields the largest performance gains. A key finding is that web-enabled LLMs' source-selection policies are systematically misaligned with sources that human Community Notes raters converge on, a gap addressable through retrieval expansion or pruning. The benchmark also proposes using Community Notes as a training signal for claim-conditioned source suggesters.

Evaluation and Benchmarking Agent and Tool Ecosystem large language models Community Notes CommunityFact

4arXiv · cs.CL·45h ago·source ↗

FACTOR: Risk-aware adaptive verification for factual long-form LLM generation

Researchers propose FACTOR (FACTuality-Oriented Risk-aware Verification), an inference-time framework that adapts verification effort based on claim-level hallucination risk rather than applying uniform verification to all claims. The system combines uncertainty estimation, adaptive language inference verification, and candidate re-ranking to focus resources on high-risk claims. Evaluated on the FactScore benchmark, FACTOR improves factuality while simultaneously reducing verification cost, with model-agnostic performance reported across ablation studies.

Evaluation and Benchmarking AI Safety Research FACTOR FactScore

6arXiv · cs.LG·26d ago·source ↗

SoundnessBench: Benchmarking LLMs as Evaluators of ML Research Proposal Viability

SoundnessBench is a new benchmark of 1,099 machine-learning research proposals derived from ICLR submissions, labeled with reviewer soundness scores, designed to test whether LLMs can reliably distinguish methodologically sound research ideas from unsound ones. Evaluated across 12 frontier LLMs, the benchmark reveals a pervasive optimism bias: models systematically rate low-soundness proposals as sound under standard prompting, with aggressive prompting shifting errors from false positives to false negatives rather than eliminating them. Controls for data contamination, surface features, and human audit quality suggest the bias is not attributable to a single confounder. The authors conclude that current LLMs are not yet reliable as standalone first-gate evaluators of scientific rigor, a critical bottleneck for autonomous AI research agents.

Evaluation and Benchmarking AI Safety Research ICLR optimism bias SoundnessBench +1 more

6arXiv · cs.CL·7d ago·source ↗

ProvenanceGuard: Source-aware factuality verification for MCP-based LLM agents

Researchers introduce ProvenanceGuard, a verifier that checks factual claims in MCP-grounded LLM agent answers against their specific source provenance rather than pooled evidence. The system decomposes answers into atomic claims, routes each to its attributed source via MCP trace metadata, and applies NLI plus token-alignment checks to detect 'cross-source conflation' — where a claim is supported somewhere but attributed to the wrong source. Evaluated on 281 medical-domain MCP-agent traces, it achieves block F1 of 0.802 and source accuracy of 0.858 on held-out data, and detects all injected attribution swaps in 50 controlled clinical probes. The work establishes source attribution as an independent factuality axis distinct from standard grounding checks.

Evaluation and Benchmarking AI Safety Research ProvenanceGuard ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents Model Context Protocol +1 more

6arXiv · cs.CL·39h ago·source ↗

LLMs fail to reliably self-report adversarial prefill attacks, study finds

A new arXiv paper evaluates whether LLMs can recognize that their own prior responses were elicited by adversarial prefill attacks, testing ten open-weight models (3B–70B) across four safety benchmarks. Models claim intent on prefilled responses only 27.3% of the time on average, and introspective signal is largely mediated by refusal-related reasoning. Three LoRA fine-tuning methods (SFT, GRPO, DPO) improve the intention-probe gap but counterintuitively raise attack success rates on most models, suggesting partial and fragile mitigation. The findings raise concerns about the reliability of LLM self-reports in safety-critical contexts.

Evaluation and Benchmarking AI Safety Research GRPO DPO Can LLMs Reliably Self-Report Adversarial Prefills, and How?+1 more

6Google Deepmind Blog·1mo ago·source ↗

FACTS Benchmark Suite: Systematically evaluating the factuality of large language models

DeepMind has released the FACTS Benchmark Suite, a systematic evaluation framework for measuring the factuality of large language models. The benchmark is designed to assess how accurately LLMs produce factually grounded outputs. This represents a structured contribution to the growing field of LLM evaluation, specifically targeting hallucination and factual reliability. The announcement comes from a Tier 1 lab, lending it credibility as a reference benchmark in the field.

Evaluation and Benchmarking AI Safety Research FACTS Benchmark Suite Google DeepMind

7arXiv · cs.CL·14d ago·source ↗

Trustworthiness audit finds alignment regressions in reasoning models converted from instruction-tuned LLMs

A systematic study audits whether converting instruction-tuned LLMs into reasoning models via SFT, RL-based post-training, or distillation preserves alignment behaviors such as safe refusal, bias avoidance, and privacy protection. Across six trustworthiness dimensions, the authors find consistent alignment regressions—including increased toxicity, amplified stereotyping, miscalibrated refusal, and privacy leakage—even as reasoning benchmark scores improve. The regressions are quantified via KL divergence from the instruction-tuned baseline, suggesting behavioral drift is a systematic byproduct of reasoning post-training. The paper argues trustworthiness metrics should be reported alongside reasoning capability gains.

Evaluation and Benchmarking AI Safety Research Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models +1 more