5arXiv cs.AI (Artificial Intelligence)·11d ago

FASE: Fast Adaptive Semantic Entropy for uncertainty quantification in multi-agent code generation

Researchers introduce Fast Adaptive Semantic Entropy (FASE), a metric for approximating functional correctness in LLM-generated code using minimum spanning trees of structural and semantic dissimilarity graphs, replacing costly LLM-driven equivalence checks. Evaluated on HumanEval and BigCodeBench with Qwen3-Embedding-8B, FASE achieves a 25% improvement in Spearman correlation and 19% increase in ROCAUC over prior semantic entropy methods. Critically, it requires only ~0.3% of the runtime cost of traditional semantic entropy approaches, making it practical for real-world multi-agent workflows.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen3 Embedding Fast Adaptive Semantic Entropy BigCodeBench HumanEval

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·11d ago·source ↗

Three-axis uncertainty estimation framework for code generation outperforms NL-derived baselines

A new arXiv preprint argues that uncertainty estimation (UE) for code generation requires code-specific design rather than methods ported from natural language. The authors propose three orthogonal uncertainty axes—lexical (token entropy), algorithmic (pseudo-code consistency), and functional (behavioral consistency)—grounded in properties unique to code: token fragility, intent-code gap, and executability. Evaluated across five code LLMs, their ensemble improves average AUROC from 0.696 to 0.776 (+8.1 points) over the strongest NL-derived baseline, with a single-pass token entropy method on Qwen3-14B matching multi-pass baselines at 3x lower cost. The work is directly relevant to safe deployment of LLMs in agentic coding pipelines.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen3-14B Code Is More Than Text: Uncertainty Estimation for Code Generation

5arXiv · cs.CL·4d ago·source ↗

Post-hoc falsification operators for frozen small code models fail to beat Best-of-N in leakage-free evaluation

A measurement study evaluates 26 post-hoc operators (selection, verification, repair, elimination, portfolios) applied to frozen small code models (≤1.5B parameters) against a Best-of-N baseline under a strict leakage-free, matched-compute protocol. None of the semantic operators improves held-out accuracy over BoN, with the failure traced to three structural mechanisms: a coverage wall, a capability scissors, and a near-empty consensus trap. Two non-semantic operators do provide value: an expression-layer recovery method (M1) lifts DeepSeek-Coder-1.3B by +12 tasks on HumanEval+ (p=2.4e-4), and an adaptive consensus early-stop saves ~19% compute with no accuracy harm. The paper's core lesson is that harness quality and coverage measurement should precede investment in semantic post-hoc reasoning.

Evaluation and Benchmarking Inference Economics Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models deepseek-coder Best-of-N +2 more

5arXiv · cs.CL·17d ago·source ↗

Clustered Self-Assessment: LLM uncertainty quantification via semantic clustering and multiple-choice self-evaluation

A new arXiv preprint proposes Clustered Self-Assessment, a method for uncertainty quantification in LLMs that groups sampled generations into semantically distinct clusters, reformats them as multiple-choice options, and uses the model's own probability assignments as confidence estimates. The approach outperforms entropy-based baselines across multiple models and datasets, achieving competitive performance with as few as two additional samples. The method is notable for directly leveraging the model's self-assessment capability rather than relying on indirect distributional signals.

Evaluation and Benchmarking AI Safety Research Clustered Self-Assessment

6arXiv · cs.CL·25d ago·source ↗

Semantic vs. Surface Noise in LLM Agents: 68-Cell Measurement Study with Held-Out Validation

This paper documents an empirical phenomenon across 10 LLMs from 7 architecture families: meaning-bearing perturbations (paraphrase, synonym substitution) cause final-answer inconsistency ~19.69 percentage points more often than presentation-level perturbations (formatting, reordering) of comparable severity, across GSM8K, MATH, and HotpotQA benchmarks. The effect is validated on a held-out 11th model (qwen2.5-14B-Instruct) with 1,800 trajectories. Trace-level analysis supports a 'stealth-divergence' picture where semantic perturbations preserve the first action but induce divergence in intermediate reasoning steps, while two prior mechanism claims are explicitly retracted. The study is notable for its honest reporting of stress-test failures and pre-registered replication.

Evaluation and Benchmarking AI Safety Research Qwen2.5-7B-Instruct-1M ReAct stealth-divergence +5 more

5arXiv · cs.CL·9d ago·source ↗

Information-theoretic metric for measuring semantic progress in multi-turn dialogue

A new arXiv preprint formalizes 'semantic progress' in multi-turn dialogue as question-conditioned uncertainty reduction and introduces an information-theoretic metric approximated in embedding space using a Gaussian formulation with closed-form updates. The metric has desirable theoretical properties (monotonicity, additive decomposition, diminishing returns) and requires no autoregressive inference at evaluation time, making it reproducible and lightweight. Experiments on MT-Bench, Chatbot Arena, and UltraFeedback show competitive or improved agreement with human judgments compared to several LLM-as-a-judge baselines. The approach works with lightweight embedding models under CPU-only execution.

Evaluation and Benchmarking Chatbot Arena MT-Bench UltraFeedback +1 more

6arXiv · cs.CL·5d ago·source ↗

AgentSpec: A modular framework for controlled composition and analysis of embodied LLM agent scaffolds

AgentSpec is a new modular specification framework that represents embodied LLM agents as typed compositions of reusable policy components with standardized interfaces across perception, memory, reasoning, reflection, action, and learning modules. The framework enables controlled swapping and recombination of components, instantiated across four benchmarks (DeliveryBench, ALFRED, MiniGrid, RoboTHOR). Key findings include that agent performance is governed by scaffold compatibility and interaction effects rather than isolated module strength, and that RL-trained policies compose best when optimized with deployment-time scaffold structure. Code, baselines, and an interactive playground are publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem DeliveryBench AgentSpec MiniGrid +2 more

7arXiv · cs.AI·26d ago·source ↗

Agentic Proving for Program Verification: Claude Code Achieves 98.1% on CLEVER Benchmark

Researchers evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation, achieving 98.1% end-to-end success on program generation and verification over self-consistent entries. The system generates valid specifications for 98.8% of problems and certifies implementations against ground-truth specifications for 87.5% of problems. The results reveal a growing mismatch between existing program verification benchmark difficulty and modern agentic prover capabilities, motivating calls for more rigorous evaluation methodologies. The findings support compiler-in-the-loop agentic paradigms as the current state-of-the-art for foundational program verification.

Evaluation and Benchmarking AI Safety Research CLEVER isomorphism-based scoring agentic proving +4 more

6arXiv · cs.CL·24d ago·source ↗

SAERL: Using Sparse Autoencoders to Guide LLM Reinforcement Learning Data Engineering

SAERL is a post-training data engineering framework that uses Sparse Autoencoders (SAEs) — a mechanistic interpretability tool — to extract intrinsic model signals for controlling data diversity, difficulty, and quality during RL fine-tuning. The framework applies SAE-space clustering for batch diversity, a difficulty proxy for curriculum ordering, and a quality probe for data filtering. On Qwen2.5-Math-1.5B with GRPO, SAERL achieves 3% average accuracy improvement and reaches target accuracy with 20% fewer training steps. SAE representations transfer across model families and scales, suggesting broad applicability as a lightweight data engineering tool.

Training Infrastructure Evaluation and Benchmarking mechanistic interpretability GRPO Reinforcement Learning from Human Feedback +6 more