5arXiv cs.AI (Artificial Intelligence)·2d ago

AxDafny: Agentic verified code generation framework achieves 92.7% on DafnyBench

Researchers introduce AxDafny, a verifier-guided agentic repair framework for generating formally verified Dafny code, including implementations, invariants, assertions, and termination arguments. The system achieves 92.7% verification success on DafnyBench, outperforming the strongest prior proof-hint baseline by 6.5 percentage points. The authors also release LCB-Pro-Dafny, a new benchmark of 250 competition-style problems translated into Dafny with formal specifications. The paper additionally finds that verification success and runtime test performance capture distinct dimensions of code quality.

Evaluation and Benchmarking Agent and Tool Ecosystem AxDafny LiveCodeBench-Pro-Dafny DafnyBench GPT-5.5

Related guides (3)

GPT-5.5

GPT-5.5: OpenAI's Benchmark Leader with a Hallucination Caveat

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

7arXiv · cs.AI·1mo ago·source ↗

Agentic Proving for Program Verification: Claude Code Achieves 98.1% on CLEVER Benchmark

Researchers evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation, achieving 98.1% end-to-end success on program generation and verification over self-consistent entries. The system generates valid specifications for 98.8% of problems and certifies implementations against ground-truth specifications for 87.5% of problems. The results reveal a growing mismatch between existing program verification benchmark difficulty and modern agentic prover capabilities, motivating calls for more rigorous evaluation methodologies. The findings support compiler-in-the-loop agentic paradigms as the current state-of-the-art for foundational program verification.

Evaluation and Benchmarking AI Safety Research CLEVER isomorphism-based scoring agentic proving +4 more

8arXiv · cs.AI·28d ago·source ↗

Goedel-Architect achieves state-of-the-art formal theorem proving with blueprint-based agentic framework

Goedel-Architect is an agentic framework for formal theorem proving in Lean 4 that uses blueprint generation — a dependency graph of definitions and lemmas — rather than recursive decomposition, enabling parallel lemma closure and global refinement. Built on DeepSeek-V4-Flash (284B-A13B), it achieves 99.2% pass@1 on MiniF2F-test and 75.6% on PutnamBench, scaling to 100% on MiniF2F, 88.8% on PutnamBench, and 4/6 on IMO 2025 when seeded with natural-language proofs. The authors claim state-of-the-art performance for an open-source pipeline at up to 500x lower cost than comparable systems.

Frontier Model Releases Evaluation and Benchmarking MiniF2F DeepSeek-V4-Flash Goedel-Architect +3 more

4arXiv · cs.CL·10d ago·source ↗

FACTOR: Risk-aware adaptive verification for factual long-form LLM generation

Researchers propose FACTOR (FACTuality-Oriented Risk-aware Verification), an inference-time framework that adapts verification effort based on claim-level hallucination risk rather than applying uniform verification to all claims. The system combines uncertainty estimation, adaptive language inference verification, and candidate re-ranking to focus resources on high-risk claims. Evaluated on the FactScore benchmark, FACTOR improves factuality while simultaneously reducing verification cost, with model-agnostic performance reported across ablation studies.

Evaluation and Benchmarking AI Safety Research FACTOR FactScore

5arXiv · cs.AI·14d ago·source ↗

Distributionally robust optimization framework for probabilistic runtime verification of AI agents

A new arXiv preprint introduces a sound and efficient framework for verifying probabilistic security policies for AI agents operating in complex digital environments, addressing limitations of prior Datalog-based approaches that assumed deterministic policies or predicate independence. The method uses distributionally robust optimization to compute sound upper bounds on policy violation probability without requiring independence assumptions between predicates. Evaluated on benchmarks for terminal and tool-calling agents, the approach outperforms prior art on the security-utility trade-off.

AI Safety Research Agent and Tool Ecosystem Datalog Efficient and Sound Probabilistic Verification for AI Agents distributionally robust optimization

5arXiv · cs.AI·24d ago·source ↗

FASE: Fast Adaptive Semantic Entropy for uncertainty quantification in multi-agent code generation

Researchers introduce Fast Adaptive Semantic Entropy (FASE), a metric for approximating functional correctness in LLM-generated code using minimum spanning trees of structural and semantic dissimilarity graphs, replacing costly LLM-driven equivalence checks. Evaluated on HumanEval and BigCodeBench with Qwen3-Embedding-8B, FASE achieves a 25% improvement in Spearman correlation and 19% increase in ROCAUC over prior semantic entropy methods. Critically, it requires only ~0.3% of the runtime cost of traditional semantic entropy approaches, making it practical for real-world multi-agent workflows.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen3 Embedding Fast Adaptive Semantic Entropy BigCodeBench +1 more

5arXiv · cs.CL·9d ago·source ↗

AdversaBench: Automated LLM red-teaming pipeline with multi-judge confirmation and cross-model transferability

AdversaBench is a new end-to-end red-teaming pipeline that mutates seed prompts using five structured operators and confirms failures via a three-judge panel with a meta-judge tiebreaker. Experiments on 45 seeds across reasoning, instruction-following, and tool-use categories produced confirmed failures for every seed. Key findings include sharp variation in operator effectiveness by category, misleading binary failure rates, judge agreement metrics distorted by label skew, and zero-shot transferability of adversarial prompts from Llama 3.1 8B to Llama 3.3 70B. Code and dataset are publicly released.

Evaluation and Benchmarking AI Safety Research Llama 3.1 70B AdversaBench Meta +1 more

5arXiv · cs.CL·17d ago·source ↗

Post-hoc falsification operators for frozen small code models fail to beat Best-of-N in leakage-free evaluation

A measurement study evaluates 26 post-hoc operators (selection, verification, repair, elimination, portfolios) applied to frozen small code models (≤1.5B parameters) against a Best-of-N baseline under a strict leakage-free, matched-compute protocol. None of the semantic operators improves held-out accuracy over BoN, with the failure traced to three structural mechanisms: a coverage wall, a capability scissors, and a near-empty consensus trap. Two non-semantic operators do provide value: an expression-layer recovery method (M1) lifts DeepSeek-Coder-1.3B by +12 tasks on HumanEval+ (p=2.4e-4), and an adaptive consensus early-stop saves ~19% compute with no accuracy harm. The paper's core lesson is that harness quality and coverage measurement should precede investment in semantic post-hoc reasoning.

Evaluation and Benchmarking Inference Economics Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models deepseek-coder Best-of-N +2 more

6arXiv · cs.AI·16d ago·source ↗

VERITAS: Visual verification enables inference-time steering and autonomous improvement for robot policies

Researchers introduce VERITAS, a generator-verifier framework pairing a pre-trained generalist robot policy with a gradient-free visual verifier to steer actions at inference time without additional training. Verified rollouts are also used for offline self-improvement via fine-tuning, achieving performance gains comparable to expert demonstrations but without human intervention. The work demonstrates that inference-time verification is a scalable mechanism for autonomous policy improvement during deployment.

Inference Economics Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement VERITAS