Almanac
benchmark

ProofNet#

benchmarkactiveprovisionalproofnet--db0f85dc·1 events·first seen 16h ago

Aliases: ProofNet#

Co-occurring entities

More like this (12)

Recent events (1)

4arXiv · cs.CL·16h ago·source ↗

Signal-Coverage Matrix proposes finer-grained evaluation of LLM autoformalization errors

A new arXiv preprint introduces the signal-coverage matrix, a 2×2 framework that crosses Lean elaborator pass/fail with semantic-equivalence judgments to decompose autoformalization errors into four distinct cells rather than a single type-correctness scalar. The authors evaluate four methods (Vanilla, Lean-Retry, Sample-Filter, and Stratified Autoformalization) on ProofNet# and MiniF2F-test using DeepSeek V4-Pro, finding that headline TC% gains mask flat semantic-only error recovery and that symbolic and LLM judges diverge by 26–37 percentage points on elaborator-feedback outputs. The work argues that TC% improvements should be credited by which error cell moved, not by the aggregate scalar alone.