benchmark
ProofNet#
benchmarkactiveprovisional
proofnet--db0f85dc·1 events·first seen 16h agoAliases: ProofNet#
Co-occurring entities
More like this (12)
Recent events (1)
Signal-Coverage Matrix proposes finer-grained evaluation of LLM autoformalization errors
A new arXiv preprint introduces the signal-coverage matrix, a 2×2 framework that crosses Lean elaborator pass/fail with semantic-equivalence judgments to decompose autoformalization errors into four distinct cells rather than a single type-correctness scalar. The authors evaluate four methods (Vanilla, Lean-Retry, Sample-Filter, and Stratified Autoformalization) on ProofNet# and MiniF2F-test using DeepSeek V4-Pro, finding that headline TC% gains mask flat semantic-only error recovery and that symbolic and LLM judges diverge by 26–37 percentage points on elaborator-feedback outputs. The work argues that TC% improvements should be credited by which error cell moved, not by the aggregate scalar alone.