Entity · benchmark

LiveCodeBench

benchmarkactivelivecodebench-dab5688b·13 events·first seen May 18, 2026

Aliases: LiveCodeBench, LiveCodeBench Pro, LiveCodeBench v6, LiveCodeBench-Pro-Dafny, LiveCodeBench v5

Merged from

LiveCodeBench-Pro-Dafny

Co-occurring entities

More like this (12)

LiveCodeBench Leaderboard BigCodeBench LiveBench PowerCodeBench LabBench JailbreakBench HealthBench SpecBench VitaBench ChipBench SlopCodeBench SkillsBench

Recent events (13)

5arXiv · cs.CL·37h ago·source ↗

Lightning OPD 2.0 mitigates style bias in cross-teacher on-policy distillation for reasoning models

A new arXiv preprint introduces Lightning OPD 2.0, a method for on-policy distillation (OPD) that addresses style bias when the SFT data generator and distillation teacher are different models. The approach uses rollout-level cross-fitting to estimate and subtract a 'style residual' from teacher-reference disagreement before constructing token-level updates. Starting from Klear-Reasoner-8B-SFT, the method achieves 82.4% on AIME 2024 and 63.0% on LiveCodeBench v5, outperforming the original Lightning OPD in cross-teacher settings. The work relaxes a key practical constraint in distillation pipelines by decoupling SFT data generation from the distillation teacher.

Evaluation and Benchmarking Alignment and RLHF Lightning OPD 2.0 AIME 2026 LiveCodeBench +1 more

5arXiv · cs.AI·5d ago·source ↗

MineValiCoder: Bipartite graph mutual validation improves LLM-based test-driven code generation

MineValiCoder is a closed-loop test-driven development framework that addresses LLM stochasticity in automated code generation by combining test-case quality mining, parallel TDD refinement, and bipartite graph-based code-test mutual validation. The system filters faulty auto-generated test cases and uses validated feedback to iteratively optimize code candidates before selecting the best via mutual validation scoring. Evaluated across four LLMs, it achieves 96.34% Pass@1 on HumanEval, 87.40% on MBPP, 64.00% on APPS, and 51.33% on LiveCodeBench, outperforming prior state-of-the-art methods.

Evaluation and Benchmarking Agent and Tool Ecosystem APPS MineValiCoder LiveCodeBench +2 more

6arXiv · cs.CL·Jul 15, 2026·source ↗

Function-Aware Fill-in-the-Middle Mid-Training Improves Coding Agent Foundation Models

Researchers propose a self-supervised mid-training objective called function-aware fill-in-the-middle (FIM) that exploits the structural isomorphism between a coding agent's action-observation-continuation loop and function call sites in ordinary code. Applied to Qwen2.5-Coder-Instruct (7B/14B) and Qwen3-8B on a 2.6B-token GitHub corpus, the method yields +2.8 to +5.4 point gains on SWE-Bench-Verified and SWE-Bench-Lite across multiple post-training pipelines. Notably, the technique also mitigates capability erosion on non-agent coding and tool-use benchmarks, suggesting the function-call inductive bias generalizes beyond the training domain.

Frontier Model Releases Evaluation and Benchmarking SWE-Smith SWE-Bench Lite Qwen2.5-Coder-32B-Instruct +8 more

7The Batch·Jul 3, 2026·source ↗

Sakana AI releases Fugu and Fugu-Ultra orchestrator models that spawn Claude, Gemini, and GPT agents

Sakana AI, a Tokyo-based research lab, released two dedicated orchestrator models—Fugu and Fugu-Ultra—that dynamically delegate tasks to a pool of underlying LLMs including Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5 under a single API. Fugu-Ultra achieves state-of-the-art results on SWE-Bench Pro, Humanity's Last Exam, LiveCodeBench Pro, and GPQA-Diamond, outperforming individual frontier models on several benchmarks. The models are trained via supervised fine-tuning plus sep-CMA-ES evolutionary optimization and GRPO reinforcement learning to select the best worker model per subtask, with Fugu-Ultra using a sub-component called Conductor to coordinate parallel agentic workflows. The approach represents a commercially available alternative to dependence on any single frontier model, with pricing available via Sakana API, OpenRouter, and Vercel.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.1 Pro Fugu GRPO +17 more

5arXiv · cs.AI·Jul 1, 2026·source ↗

AxDafny: Agentic verified code generation framework achieves 92.7% on DafnyBench

Researchers introduce AxDafny, a verifier-guided agentic repair framework for generating formally verified Dafny code, including implementations, invariants, assertions, and termination arguments. The system achieves 92.7% verification success on DafnyBench, outperforming the strongest prior proof-hint baseline by 6.5 percentage points. The authors also release LCB-Pro-Dafny, a new benchmark of 250 competition-style problems translated into Dafny with formal specifications. The paper additionally finds that verification success and runtime test performance capture distinct dimensions of code quality.

Evaluation and Benchmarking Agent and Tool Ecosystem AxDafny LiveCodeBench DafnyBench +1 more

6arXiv · cs.LG·Jun 26, 2026·source ↗

RiVER framework enables RL training of LLMs on tasks without ground-truth solutions

Researchers introduce RiVER (Ranking-induced VERifiable framework), a reinforcement learning approach that trains LLMs on score-based optimization tasks using deterministic execution feedback as continuous rewards, without requiring ground-truth answers. The method addresses two failure modes in group-relative RL with continuous rewards—scale dominance and frequency dominance—via calibrated, instance-wise reward shaping. Applied to Qwen3-8B and GLM-Z1-9B-0414 on competitive programming tasks, RiVER improves ALE rating rank by ~9% and also transfers to exact-solution benchmarks (LiveCodeBench, USACO) with 2-4% absolute gains, unlike raw-score baselines. The result suggests score-based heuristic tasks can serve as general-purpose RL training environments for coding ability.

Evaluation and Benchmarking Alignment and RLHF USACO Qwen3-4B LiveCodeBench +3 more

5arXiv · cs.AI·Jun 19, 2026·source ↗

Multi-LCB extends LiveCodeBench to twelve programming languages for cross-language code evaluation

Researchers introduce Multi-LCB, a benchmark that extends the widely-used LiveCodeBench (LCB) to twelve programming languages by transforming Python tasks into equivalent tasks in other languages while preserving LCB's contamination controls. The benchmark evaluates 24 LLMs and uncovers Python overfitting, language-specific contamination, and large performance disparities across languages. Multi-LCB is designed to auto-update with future LCB releases, making it a living benchmark for multilingual code generation assessment.

Frontier Model Releases Evaluation and Benchmarking LiveCodeBench Multi-LCB

7Mistral Ai News·Jun 1, 2026·source ↗

Codestral 25.01: Mistral AI Releases Updated Coding Model with 2x Speed and Improved FIM Performance

Mistral AI has released Codestral 25.01, a significant upgrade to its Codestral coding model featuring a more efficient architecture and improved tokenizer that generates code approximately 2x faster than its predecessor. The model claims state-of-the-art performance for fill-in-the-middle (FIM) tasks across sub-100B parameter models, with a 256k context window and support for 80+ programming languages. Benchmarks show improvements over Codestral 2405 and competitive or superior results against DeepSeek Coder V2 lite and DeepSeek Coder 33B on HumanEval and FIM metrics. The model is available via Mistral's API, IDE plugins (VS Code, JetBrains via Continue), and for on-premises/VPC deployment, with cloud availability on Vertex AI and Azure AI Foundry.

Frontier Model Releases Evaluation and Benchmarking Mistral AI HumanEvalFIM Azure Foundry +12 more

6arXiv · cs.CL·May 27, 2026·source ↗

Pair-In, Pair-Out (PIPO): Unified Latent Compression and Multi-Token Prediction for Efficient LLM Inference

PIPO is a new inference efficiency framework that unifies input-side latent compression with output-side multi-token prediction (MTP) by treating them as mirror operations: a compressor folds two input tokens into one latent, while an MTP head unfolds one hidden state into an additional output token. To avoid the expensive verifier pass typically required by speculative decoding, PIPO trains a lightweight confidence head using On-Policy Distillation (OPD), which naturally aligns with rejection-sampling criteria. Experiments on Qwen3.5-4B and 9B backbones across AIME 2025, GPQA-Diamond, LiveCodeBench v6, and LongBench v2 show up to 2.64× first-token-latency speedup and +7.15 pass@4 improvement over regular decoding.

Long Context Evolution Inference Economics On-Policy Distillation (OPD)Multi-Token Prediction (MTP)speculative decoding +7 more

5Hugging Face Blog·May 19, 2026·source ↗

Introducing the LiveCodeBench Leaderboard - Holistic and Contamination-Free Evaluation of Code LLMs

Hugging Face introduces a leaderboard based on LiveCodeBench, a benchmark designed for holistic and contamination-free evaluation of code-generating large language models. The benchmark continuously collects new coding problems from competitive programming platforms to prevent data contamination that plagues static benchmarks. It evaluates models across multiple code-related tasks beyond just code generation, aiming to provide a more reliable signal of true model capability.

Evaluation and Benchmarking Agent and Tool Ecosystem LiveCodeBench Hugging Face LiveCodeBench Leaderboard

6The Batch·May 18, 2026·source ↗

Data Points: Thinking Machines Interaction Model, ERNIE 5.1, Co-Mathematician, RL Conductor, and More

This edition of The Batch covers five notable AI developments: Thinking Machines' research preview of an 'interaction model' with a 200ms micro-turn multimodal architecture; Baidu's ERNIE 5.1, a compressed derivative of ERNIE 5.0 using only 6% of typical pre-training compute; Google DeepMind's Co-Mathematician collaborative workbench reaching 48% on FrontierMath Tier 4; a 7B RL Conductor model that orchestrates multi-agent workflows via reinforcement learning; and Google's Magic Pointer cursor system powered by Gemini. Secondary items include GitHub Copilot pricing restructuring ahead of usage-based billing.

Training Infrastructure Frontier Model Releases Thinking Machines SGLang GitHub +21 more

6Deepseek News·May 18, 2026·source ↗

DeepSeek-V2.5: Merged Open-Source Model Combining General and Coding Capabilities

DeepSeek has released DeepSeek-V2.5, an open-source model that merges DeepSeek-V2-Chat-0628 and DeepSeek-Coder-V2-0724 into a single unified model. The release improves general conversational capabilities, coding performance, instruction-following, and writing tasks while also strengthening safety properties—raising the overall safety score from 74.4% to 82.6% and reducing safety spillover rate from 11.3% to 4.6%. The model is available via backward-compatible API endpoints (deepseek-chat and deepseek-coder) and on HuggingFace, retaining features like Function Calling, FIM completion, and JSON output. Benchmark results show improvements on HumanEval Python and LiveCodeBench, though SWE-verified performance remains an acknowledged weak area.

Frontier Model Releases Evaluation and Benchmarking DeepSeek-V2-Chat-0628 DeepSeek V4 SWE-Bench Verified +8 more

8Mistral Ai News·May 18, 2026·source ↗

Mistral Small 4: Unified Multimodal, Reasoning, and Coding MoE Model Released Under Apache 2.0

Mistral AI has released Mistral Small 4, a 119B-parameter Mixture-of-Experts model (6B active per token) that unifies capabilities previously split across Magistral (reasoning), Pixtral (multimodal), and Devstral (coding agents) into a single open-weights model. The model features a 256k context window, configurable reasoning effort via a `reasoning_effort` parameter, native text and image input support, and is released under Apache 2.0. Mistral claims 40% latency reduction and 3x throughput improvement over Mistral Small 3, with benchmark results showing competitive performance against GPT-OSS 120B and Qwen models while producing significantly shorter outputs. The release includes day-0 availability as an NVIDIA NIM and support across vLLM, llama.cpp, SGLang, and Transformers.

Long Context Evolution Frontier Model Releases Mistral AI Mistral Small 4 Pixtral +14 more