8arXiv cs.AI (Artificial Intelligence)·10d ago

ABC-Bench: Agentic biosecurity benchmark finds LLM agents surpass median expert humans on dual-use biology tasks

Researchers introduce ABC-Bench, a benchmark evaluating LLM agents on biosecurity-relevant biology tasks including liquid-handling robot programming, DNA fragment design, and evasion of DNA synthesis screening. All tested agents outperformed the median expert human baseline across all three tasks. Wet-lab validation confirmed that OpenAI's o4-mini-high produced scripts that successfully assembled DNA on an OpenTrons robot. The results highlight a meaningful shift in the biosecurity risk landscape as AI agents acquire practical wet-lab-adjacent capabilities.

Frontier Model Releases Evaluation and Benchmarking AI Safety Research ABC-Bench OpenTrons o4-mini-high OpenAI

Related guides (3)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Related events (8)

7Openai Blog·1mo ago·source ↗

Building an Early Warning System for LLM-Aided Biological Threat Creation

OpenAI published a blueprint for evaluating whether LLMs can meaningfully assist in biological threat creation. In a controlled study with biology experts and students, GPT-4 was found to provide at most mild uplift in biological threat creation accuracy. The results are inconclusive but are framed as a starting point for ongoing safety research and community deliberation on biosecurity risks from AI.

Evaluation and Benchmarking AI Safety Research biological threat creation evaluation OpenAI GPT-4

8Anthropic News·17d ago·source ↗

Anthropic Frontier Red Team reports early-warning signs of rapid AI progress in cybersecurity and biosecurity capabilities

Anthropic's Frontier Red Team published findings from a year of safety evaluations across four model releases, documenting rapid capability gains in dual-use domains. In cybersecurity, Claude 3.7 Sonnet now solves roughly a third of Cybench CTF challenges (up from ~5% a year ago), and with the Incalmo toolset was able to replicate a large-scale network attack in realistic cyber range environments. In biosecurity, Claude has moved from underperforming virology experts to exceeding them on the VCT benchmark within one year, and exceeds human expert baselines on cloning workflows. Anthropic assesses current models as showing 'early warning' signs but not yet crossing thresholds of substantially elevated national security risk.

Frontier Model Releases Evaluation and Benchmarking Intercode CTF Carnegie Mellon University LabBench +7 more

6Openai Blog·1mo ago·source ↗

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

OpenAI introduces MLE-bench, a benchmark designed to measure AI agent performance on machine learning engineering tasks. The benchmark draws from Kaggle competitions to evaluate agents on realistic ML engineering workflows. Initial results show that current agents, including those powered by o1-preview, achieve competitive performance on a subset of tasks but fall well short of top human competitors. The benchmark is intended to track progress in agentic ML capabilities over time.

Frontier Model Releases Evaluation and Benchmarking Kaggle o1-preview MLE-bench +2 more

6arXiv · cs.AI·12d ago·source ↗

AARRI-Bench evaluates frontier LLMs and agents on granular research-intern-level tasks

Researchers introduce AARR (Act As a Real Researcher), a new benchmark series targeting whether AI agents can emulate the professionalism, thoroughness, and nuanced judgment of human researchers in granular research scenarios—not just macro-level task execution. The first benchmark, AARRI-Bench, tests frontier models and agentic harnesses, finding that even the best configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3% success, frequently missing subtle but critical details obvious to human researchers. The work argues that closing the gap requires deeper modeling of research behavior rather than more complex scaffolding.

Evaluation and Benchmarking Agent and Tool Ecosystem Claude Opus 4.6 SWE-bench AARRI-Bench +2 more

7Openai Blog·1mo ago·source ↗

OpenAI and Los Alamos National Laboratory Announce Research Partnership on Biosafety Evaluations

OpenAI and Los Alamos National Laboratory (LANL) have announced a research partnership focused on developing safety evaluations for frontier AI models. The collaboration specifically targets assessing and measuring biological capabilities and risks. LANL brings national-lab-level biosecurity expertise to the effort, which aligns with OpenAI's broader preparedness framework for catastrophic risk domains.

Evaluation and Benchmarking AI Safety Research Los Alamos National Laboratory biological risk evaluation Preparedness Framework +2 more

6Openai Blog·1mo ago·source ↗

Preparing for future AI risks in biology

OpenAI has published a post outlining its proactive approach to assessing and mitigating biosecurity risks from advanced AI systems capable of biological applications. The piece describes capability evaluations and safeguards designed to prevent misuse of AI in biology and medicine. This reflects OpenAI's ongoing effort to get ahead of dual-use risks before capabilities reach dangerous thresholds.

Evaluation and Benchmarking AI Safety Research OpenAI biology/medicine dual-use AI AI biosecurity risk assessment

6arXiv · cs.AI·2d ago·source ↗

TxBench-PP: New benchmark reveals AI agents struggle with preclinical pharmacology decisions

Researchers introduce TxBench-PP (TherapeuticsBench Preclinical Pharmacology), a 100-evaluation benchmark testing AI agents on realistic small-molecule drug discovery tasks including mechanism-of-action reasoning, compound-target engagement, and translational efficacy. Agents receive real workflow snapshots and are graded deterministically on structured answers. Across 16 model-harness configurations and 4,800 trajectories, no system reliably succeeded; the best performer, Claude Opus 4.8 with the Pi harness, passed only 59.3% of endpoint attempts. The results suggest current frontier models are not yet deployment-ready for autonomous preclinical pharmacology decision-making.

Evaluation and Benchmarking Agent and Tool Ecosystem Claude Opus 4.6 OpenAI TherapeuticsBench +3 more

6arXiv · cs.AI·15d ago·source ↗

Benchmark Agent: Autonomous system for end-to-end benchmark construction

Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem Benchmark Everything Everywhere All at Once Benchmark Agent