benchmark
CLEVER
benchmarkactiveprovisional
clever-bd8c9adb·1 events·first seen 22d agoAliases: CLEVER
Co-occurring entities
More like this (12)
Recent events (1)
Agentic Proving for Program Verification: Claude Code Achieves 98.1% on CLEVER Benchmark
Researchers evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation, achieving 98.1% end-to-end success on program generation and verification over self-consistent entries. The system generates valid specifications for 98.8% of problems and certifies implementations against ground-truth specifications for 87.5% of problems. The results reveal a growing mismatch between existing program verification benchmark difficulty and modern agentic prover capabilities, motivating calls for more rigorous evaluation methodologies. The findings support compiler-in-the-loop agentic paradigms as the current state-of-the-art for foundational program verification.