Entity · benchmark

CLEVER

benchmarkactiveclever-bd8c9adb·1 events·first seen May 25, 2026

Aliases: CLEVER

Co-occurring entities

isomorphism-based scoring agentic proving Claude Code Lean 4 Anthropic

More like this (12)

CRED COGENT BRIGHT WISE CRAM CADE CLIP SIMPLER QUIET CHEAP ECL C4STYLI

Recent events (1)

7arXiv · cs.AI·May 25, 2026·source ↗

Agentic Proving for Program Verification: Claude Code Achieves 98.1% on CLEVER Benchmark

Researchers evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation, achieving 98.1% end-to-end success on program generation and verification over self-consistent entries. The system generates valid specifications for 98.8% of problems and certifies implementations against ground-truth specifications for 87.5% of problems. The results reveal a growing mismatch between existing program verification benchmark difficulty and modern agentic prover capabilities, motivating calls for more rigorous evaluation methodologies. The findings support compiler-in-the-loop agentic paradigms as the current state-of-the-art for foundational program verification.

Evaluation and Benchmarking AI Safety Research CLEVER isomorphism-based scoring agentic proving +4 more