Entity · benchmark

HumanEval

benchmarkactivehumaneval-ecd28450·17 events·first seen May 18, 2026

Aliases: HumanEval, HumanEval+

Co-occurring entities

More like this (12)

HumanEvalFIM Codex HumanEval HypoEval L-Eval G-Eval ValueEval T-Eval CharacterEval AI-assisted human evaluation OpenAI Evals Human Engineering Research Laboratories DeepEval

Recent events (17)

5arXiv · cs.CL·3d ago·source ↗

Typed loop contracts for agentic code repair: evidence that stale traces degrade correctness retention

A new arXiv preprint studies the reliability gap in generate-test-revise loops used by coding agents, finding that forced revision cycles cause current correctness to drop from 0.820 to 0.673 even as ever-correct rises to 0.847. Controlled experiments with 2,430 branches show stale traces harm 34/135 correct starts versus 4/135 with current traces, a statistically significant 22.2-point increase. The authors formalize the problem by separating admission, preservation, and certification concerns, then derive a typed loop contract with a mechanically enforceable reference implementation that binds verifier evidence to exact code states. The work is framed explicitly as a specification artifact rather than a claim of improved repair competence.

Evaluation and Benchmarking AI Safety Research HumanEval Looping Is Not Reliability: State-Bound Evidence and Typed Revision Contracts for Agentic Code Repair +1 more

5arXiv · cs.AI·4d ago·source ↗

MineValiCoder: Bipartite graph mutual validation improves LLM-based test-driven code generation

MineValiCoder is a closed-loop test-driven development framework that addresses LLM stochasticity in automated code generation by combining test-case quality mining, parallel TDD refinement, and bipartite graph-based code-test mutual validation. The system filters faulty auto-generated test cases and uses validated feedback to iteratively optimize code candidates before selecting the best via mutual validation scoring. Evaluated across four LLMs, it achieves 96.34% Pass@1 on HumanEval, 87.40% on MBPP, 64.00% on APPS, and 51.33% on LiveCodeBench, outperforming prior state-of-the-art methods.

Evaluation and Benchmarking Agent and Tool Ecosystem APPS MineValiCoder LiveCodeBench +2 more

6The Batch·Jul 16, 2026·source ↗

Data Points: Apple sues OpenAI; Meta Muse Spark 1.1; ChatGPT Work; IBM CodeAlchemy; OpenAI Atlas shutdown

A multi-item digest covers five significant AI developments: Apple sued OpenAI alleging trade secret theft via former employees including hardware chief Tang Tan; Meta released Muse Spark 1.1, a multimodal agentic model with 1M-token context and strong tool-use capabilities; OpenAI launched ChatGPT Work, a cloud-based workplace agent competing with Anthropic's Claude Cowork; IBM released CodeAlchemy, a 500B+ token synthetic code dataset with execution traces showing smaller models trained on it outperform those trained on much larger real-code corpora; and OpenAI shut down its Atlas browser in favor of a Chrome extension and desktop integration. These items collectively reflect intensifying competition across agentic products, synthetic data strategies, and legal disputes between major AI players.

Training Infrastructure Frontier Model Releases CodeAlchemy IBM Fidji Simo +17 more

5arXiv · cs.CL·Jun 16, 2026·source ↗

Post-hoc falsification operators for frozen small code models fail to beat Best-of-N in leakage-free evaluation

A measurement study evaluates 26 post-hoc operators (selection, verification, repair, elimination, portfolios) applied to frozen small code models (≤1.5B parameters) against a Best-of-N baseline under a strict leakage-free, matched-compute protocol. None of the semantic operators improves held-out accuracy over BoN, with the failure traced to three structural mechanisms: a coverage wall, a capability scissors, and a near-empty consensus trap. Two non-semantic operators do provide value: an expression-layer recovery method (M1) lifts DeepSeek-Coder-1.3B by +12 tasks on HumanEval+ (p=2.4e-4), and an adaptive consensus early-stop saves ~19% compute with no accuracy harm. The paper's core lesson is that harness quality and coverage measurement should precede investment in semantic post-hoc reasoning.

Evaluation and Benchmarking Inference Economics Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models deepseek-coder Best-of-N +2 more

5arXiv · cs.CL·Jun 10, 2026·source ↗

ADAS: Attention-Discounted Adaptive Sampler improves parallel decoding for masked diffusion language models

Researchers propose ADAS, a training-free reranking rule for masked diffusion language model decoding that addresses token interaction failures in parallel token commitment. The method greedily penalizes candidates that attend strongly to already-selected uncertain positions, using attention weights as soft marginal penalties rather than hard constraints. Evaluated on LLaDA-8B-Base and Dream-7B-Base across GSM8K, MATH500, HumanEval, and MBPP, ADAS improves low-NFE performance by 9–10 percentage points on average when plugged into existing samplers with only 3.1% runtime overhead.

Frontier Model Releases Inference Economics LLaDA-8B-Base MATH500 EB-Sampler +6 more

5arXiv · cs.AI·Jun 9, 2026·source ↗

FASE: Fast Adaptive Semantic Entropy for uncertainty quantification in multi-agent code generation

Researchers introduce Fast Adaptive Semantic Entropy (FASE), a metric for approximating functional correctness in LLM-generated code using minimum spanning trees of structural and semantic dissimilarity graphs, replacing costly LLM-driven equivalence checks. Evaluated on HumanEval and BigCodeBench with Qwen3-Embedding-8B, FASE achieves a 25% improvement in Spearman correlation and 19% increase in ROCAUC over prior semantic entropy methods. Critically, it requires only ~0.3% of the runtime cost of traditional semantic entropy approaches, making it practical for real-world multi-agent workflows.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen3 Embedding Fast Adaptive Semantic Entropy BigCodeBench +1 more

5arXiv · cs.CL·Jun 5, 2026·source ↗

Pre-registered study finds Popperian code-generation prompt skills add no benefit beyond structural scaffolding

A pre-registered two-tier ablation study tests whether 'Popperian falsificationist' prompt skills improve LLM code generation through their procedural content or merely through structural scaffolding. Using Claude Sonnet 4.6 and Qwen2.5-Coder-0.5B with execution-based evaluation (HumanEval+ unit tests) rather than LLM-as-judge, the authors find that on the small model, structured prompts lift correctness by 20-22 points but the full Popperian skill shows no separable benefit over a labels-only scaffold. The paper contributes a calibrated negative result and a reusable disambiguation protocol for evaluating prompt-skill families, while also documenting that LLM self-judges at 0.5B scale perform no better than random selection.

Evaluation and Benchmarking Claude Sonnet 4 Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill HumanEval +2 more

7Anthropic News·Jun 3, 2026·source ↗

Claude 3.5 Sonnet begins rollout on GitHub Copilot via Amazon Bedrock

Anthropic's Claude 3.5 Sonnet is now rolling out on GitHub Copilot, available in public preview for all Copilot Chat users in Visual Studio Code and GitHub.com. The model claims top performance on SWE-bench Verified among publicly available models and 93.7% on HumanEval. The integration runs via Amazon Bedrock's cross-region inference and reaches GitHub's community of over 100 million developers, representing a significant distribution milestone for Claude.

Frontier Model Releases Enterprise Deployment Patterns Amazon Bedrock Microsoft GitHub +7 more

8Anthropic News·Jun 2, 2026·source ↗

Introducing Claude 3.5 Sonnet

Anthropic launches Claude 3.5 Sonnet, the first model in its Claude 3.5 family, claiming it outperforms Claude 3 Opus and competitor models on GPQA, MMLU, and HumanEval benchmarks while operating at twice the speed and mid-tier pricing ($3/$15 per million tokens). The model features a 200K context window, improved vision capabilities, and an internal agentic coding evaluation score of 64% versus 38% for Opus. Alongside the model, Anthropic introduces Artifacts on Claude.ai, a dedicated workspace for real-time editing of AI-generated content. The model was pre-deployment evaluated by the UK AI Safety Institute and assessed at ASL-2.

Long Context Evolution Frontier Model Releases claude.ai Thorn Amazon Bedrock +16 more

8Mistral Ai News·Jun 1, 2026·source ↗

Mistral AI Releases Mixtral 8x22B Under Apache 2.0

Mistral AI has released Mixtral 8x22B, a sparse Mixture-of-Experts model with 141B total parameters but only 39B active parameters, under the permissive Apache 2.0 license. The model features a 64K token context window, native function calling, multilingual support across five European languages, and strong math and coding performance. Mistral claims it outperforms all other open-weight models on standard benchmarks while being faster than dense 70B models due to sparse activation. An instructed version achieves 90.8% on GSM8K maj@8.

Frontier Model Releases Open Weights Progress Mistral AI Llama 2 70B Apache 2.0 +10 more

8Mistral Ai News·Jun 1, 2026·source ↗

Mistral AI Releases Mistral Large, Claims Second-Best API Model After GPT-4

Mistral AI has released Mistral Large, its most capable model to date, claiming second place among API-accessible models behind GPT-4 on standard benchmarks including MMLU, HellaSwag, and coding/math evals. The model features a 32K context window, native fluency in five European languages, function calling, and constrained output mode. Simultaneously, Mistral is launching a new Mistral Small optimized for latency, restructuring its endpoint lineup, and announcing Microsoft Azure as its first major distribution partner. This marks Mistral's first significant commercial partnership and expansion beyond its own infrastructure.

Long Context Evolution Frontier Model Releases Azure AI Studio Mistral AI Llama 2 70B +13 more

7Mistral Ai News·Jun 1, 2026·source ↗

Mistral AI Releases Codestral: 22B Open-Weight Code Generation Model

Mistral AI has released Codestral, a 22B open-weight model explicitly designed for code generation, supporting 80+ programming languages with a 32k context window. The model is available under a non-production license on HuggingFace, with commercial licenses available on request, and is accessible via a dedicated API endpoint (codestral.mistral.ai) free during an 8-week beta. Codestral claims state-of-the-art performance on RepoBench, HumanEval, and fill-in-the-middle benchmarks, outperforming DeepSeek Coder 33B and matching or exceeding GPT-4-Turbo on some language-specific evals. Integrations are available with LlamaIndex, LangChain, Continue.dev, and Tabnine for IDE-based developer workflows.

Frontier Model Releases Evaluation and Benchmarking Mistral AI LlamaIndex GPT-4 Turbo +17 more

7Mistral Ai News·Jun 1, 2026·source ↗

Codestral 25.01: Mistral AI Releases Updated Coding Model with 2x Speed and Improved FIM Performance

Mistral AI has released Codestral 25.01, a significant upgrade to its Codestral coding model featuring a more efficient architecture and improved tokenizer that generates code approximately 2x faster than its predecessor. The model claims state-of-the-art performance for fill-in-the-middle (FIM) tasks across sub-100B parameter models, with a 256k context window and support for 80+ programming languages. Benchmarks show improvements over Codestral 2405 and competitive or superior results against DeepSeek Coder V2 lite and DeepSeek Coder 33B on HumanEval and FIM metrics. The model is available via Mistral's API, IDE plugins (VS Code, JetBrains via Continue), and for on-premises/VPC deployment, with cloud availability on Vertex AI and Azure AI Foundry.

Frontier Model Releases Evaluation and Benchmarking Mistral AI HumanEvalFIM Azure Foundry +12 more

7arXiv · cs.LG·May 29, 2026·source ↗

Entropy-Cut Metropolis-Hastings: Sampling-Based Reasoning Without RL Training

This paper introduces Entropy-Cut Metropolis-Hastings (ECMH), an algorithm that samples from a 'power distribution' over base language model outputs to elicit strong reasoning without reinforcement learning posttraining. Rather than cutting reasoning traces at uniformly random positions, ECMH uses next-token entropy as a proxy to identify consequential decision points (e.g., choice of proof strategy), then resamples from those positions. The authors prove that mixing time scales with the number of decisions rather than tokens, and demonstrate consistent improvements over RL-trained models on MATH500, HumanEval, GPQA Diamond, and AIME26.

Frontier Model Releases Evaluation and Benchmarking power distribution MATH500 Entropy-Cut Metropolis-Hastings +6 more

8Openai Blog·May 20, 2026·source ↗

Evaluating Large Language Models Trained on Code

OpenAI published research on evaluating large language models trained on code, introducing the Codex model and the HumanEval benchmark for assessing code generation capabilities. The work established foundational methodology for measuring functional correctness of code produced by LLMs using a pass@k metric. This paper became a landmark reference for code-focused LLM evaluation and influenced subsequent code generation research across the field.

Frontier Model Releases Evaluation and Benchmarking GPT-3 pass@k OpenAI +3 more

5Hugging Face Blog·May 19, 2026·source ↗

BigCodeBench: The Next Generation of HumanEval

Hugging Face introduces BigCodeBench, a new code generation benchmark designed to succeed HumanEval by offering more challenging and diverse programming tasks. The benchmark aims to better evaluate LLMs on real-world coding scenarios involving complex function calls and library usage. A leaderboard accompanies the release to track model performance across the community.

Evaluation and Benchmarking Agent and Tool Ecosystem BigCodeBench Hugging Face HumanEval

6Deepseek News·May 18, 2026·source ↗

DeepSeek-V2.5: Merged Open-Source Model Combining General and Coding Capabilities

DeepSeek has released DeepSeek-V2.5, an open-source model that merges DeepSeek-V2-Chat-0628 and DeepSeek-Coder-V2-0724 into a single unified model. The release improves general conversational capabilities, coding performance, instruction-following, and writing tasks while also strengthening safety properties—raising the overall safety score from 74.4% to 82.6% and reducing safety spillover rate from 11.3% to 4.6%. The model is available via backward-compatible API endpoints (deepseek-chat and deepseek-coder) and on HuggingFace, retaining features like Function Calling, FIM completion, and JSON output. Benchmark results show improvements on HumanEval Python and LiveCodeBench, though SWE-verified performance remains an acknowledged weak area.

Frontier Model Releases Evaluation and Benchmarking DeepSeek-V2-Chat-0628 DeepSeek V4 SWE-Bench Verified +8 more