benchmark

ProofNet#

benchmarkactiveprovisionalproofnet--db0f85dc·1 events·first seen 16h ago

Aliases: ProofNet#

Co-occurring entities

DeepSeek V4 MiniF2F Lean The Signal-Coverage Matrix: Stratifying Type and Semantic Errors in Statement Autoformalization

More like this (12)

ProofNet-Test PRNet HypeNet BitNet BitNet b1.58 ControlNet BrushNet GFNet BIRDNet OpenAI Partner Network Project Numina ProducerAI

Recent events (1)

4arXiv · cs.CL·16h ago·source ↗

Signal-Coverage Matrix proposes finer-grained evaluation of LLM autoformalization errors

A new arXiv preprint introduces the signal-coverage matrix, a 2×2 framework that crosses Lean elaborator pass/fail with semantic-equivalence judgments to decompose autoformalization errors into four distinct cells rather than a single type-correctness scalar. The authors evaluate four methods (Vanilla, Lean-Retry, Sample-Filter, and Stratified Autoformalization) on ProofNet# and MiniF2F-test using DeepSeek V4-Pro, finding that headline TC% gains mask flat semantic-only error recovery and that symbolic and LLM judges diverge by 26–37 percentage points on elaborator-feedback outputs. The work argues that TC% improvements should be credited by which error cell moved, not by the aggregate scalar alone.

Evaluation and Benchmarking DeepSeek V4 MiniF2F Lean +2 more