Entity · benchmark

MBPP

benchmarkactivembpp-bfa40e60·6 events·first seen Jun 1, 2026

Aliases: MBPP, MBPP+

Co-occurring entities

More like this (12)

MMDP MPP-AViT MMBench2 MMMU-Pro MA²P ENPMR-Bench MM-EPC WPP MP3D DDPM MCP ClaMPAPP

Recent events (6)

5arXiv · cs.AI·5d ago·source ↗

MineValiCoder: Bipartite graph mutual validation improves LLM-based test-driven code generation

MineValiCoder is a closed-loop test-driven development framework that addresses LLM stochasticity in automated code generation by combining test-case quality mining, parallel TDD refinement, and bipartite graph-based code-test mutual validation. The system filters faulty auto-generated test cases and uses validated feedback to iteratively optimize code candidates before selecting the best via mutual validation scoring. Evaluated across four LLMs, it achieves 96.34% Pass@1 on HumanEval, 87.40% on MBPP, 64.00% on APPS, and 51.33% on LiveCodeBench, outperforming prior state-of-the-art methods.

Evaluation and Benchmarking Agent and Tool Ecosystem APPS MineValiCoder LiveCodeBench +2 more

5arXiv · cs.CL·Jul 17, 2026·source ↗

Mask-Aware Policy Gradients improve RL training for Masked Diffusion Language Models

A new arXiv preprint introduces a two-stage action MDP formalization for applying reinforcement learning to Masked Diffusion Language Models (MDLMs), decomposing the policy gradient into a token prediction term and a masking order term. Prior approaches ignored the position-unmasking decision, leading to intractable log-likelihood estimates; the proposed method optimizes both terms jointly. The approach achieves 87.1% on GSM8K and 53.4% on MBPP, claiming state-of-the-art results for MDLM-based reasoning and coding.

Evaluation and Benchmarking Alignment and RLHF Diffusion Language Models Mask-Aware Policy Gradients for Diffusion Language Models MBPP +1 more

6arXiv · cs.CL·Jul 15, 2026·source ↗

Double Ratchet: Co-evolving evaluation metrics and skills for self-improving LLM agents

A new arXiv preprint introduces Double Ratchet, a system that co-evolves both evaluation metrics and agent skills in settings where no reliable automatic verifier exists. The metric loop uses evolutionary search over small drawback detectors anchored to a small reference set, while the skill loop uses a lifecycle-managed approach; together they retain 88–110% of the performance lift achievable with ground-truth metrics across code generation (MBPP+), text-to-SQL (Spider 2.0-Snow), and report generation tasks. The paper also addresses safety, showing that anchor discipline and outer audits can catch and repair cases where evolved skills game the rubric. This work directly addresses a core bottleneck in self-improving agent systems: the chicken-and-egg problem of needing a reliable evaluator to improve.

Evaluation and Benchmarking AI Safety Research Double Ratchet MBPP Who Grades the Grader? Co-Evolving Evaluation Metrics and Skills for Self-Improving LLM Agents +2 more

5arXiv · cs.CL·Jun 16, 2026·source ↗

Post-hoc falsification operators for frozen small code models fail to beat Best-of-N in leakage-free evaluation

A measurement study evaluates 26 post-hoc operators (selection, verification, repair, elimination, portfolios) applied to frozen small code models (≤1.5B parameters) against a Best-of-N baseline under a strict leakage-free, matched-compute protocol. None of the semantic operators improves held-out accuracy over BoN, with the failure traced to three structural mechanisms: a coverage wall, a capability scissors, and a near-empty consensus trap. Two non-semantic operators do provide value: an expression-layer recovery method (M1) lifts DeepSeek-Coder-1.3B by +12 tasks on HumanEval+ (p=2.4e-4), and an adaptive consensus early-stop saves ~19% compute with no accuracy harm. The paper's core lesson is that harness quality and coverage measurement should precede investment in semantic post-hoc reasoning.

Evaluation and Benchmarking Inference Economics Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models deepseek-coder Best-of-N +2 more

5arXiv · cs.CL·Jun 10, 2026·source ↗

ADAS: Attention-Discounted Adaptive Sampler improves parallel decoding for masked diffusion language models

Researchers propose ADAS, a training-free reranking rule for masked diffusion language model decoding that addresses token interaction failures in parallel token commitment. The method greedily penalizes candidates that attend strongly to already-selected uncertain positions, using attention weights as soft marginal penalties rather than hard constraints. Evaluated on LLaDA-8B-Base and Dream-7B-Base across GSM8K, MATH500, HumanEval, and MBPP, ADAS improves low-NFE performance by 9–10 percentage points on average when plugged into existing samplers with only 3.1% runtime overhead.

Frontier Model Releases Inference Economics LLaDA-8B-Base MATH500 EB-Sampler +6 more

7Mistral Ai News·Jun 1, 2026·source ↗

Mistral AI Releases Codestral: 22B Open-Weight Code Generation Model

Mistral AI has released Codestral, a 22B open-weight model explicitly designed for code generation, supporting 80+ programming languages with a 32k context window. The model is available under a non-production license on HuggingFace, with commercial licenses available on request, and is accessible via a dedicated API endpoint (codestral.mistral.ai) free during an 8-week beta. Codestral claims state-of-the-art performance on RepoBench, HumanEval, and fill-in-the-middle benchmarks, outperforming DeepSeek Coder 33B and matching or exceeding GPT-4-Turbo on some language-specific evals. Integrations are available with LlamaIndex, LangChain, Continue.dev, and Tabnine for IDE-based developer workflows.

Frontier Model Releases Evaluation and Benchmarking Mistral AI LlamaIndex GPT-4 Turbo +17 more