Almanac
benchmark

ABC-Bench

benchmarkactiveprovisionalabc-bench-14bd3fd7·1 events·first seen 6d ago

Aliases: ABC-Bench

Co-occurring entities

More like this (12)

Recent events (1)

8arXiv · cs.AI·6d ago·source ↗

ABC-Bench: Agentic biosecurity benchmark finds LLM agents surpass median expert humans on dual-use biology tasks

Researchers introduce ABC-Bench, a benchmark evaluating LLM agents on biosecurity-relevant biology tasks including liquid-handling robot programming, DNA fragment design, and evasion of DNA synthesis screening. All tested agents outperformed the median expert human baseline across all three tasks. Wet-lab validation confirmed that OpenAI's o4-mini-high produced scripts that successfully assembled DNA on an OpenTrons robot. The results highlight a meaningful shift in the biosecurity risk landscape as AI agents acquire practical wet-lab-adjacent capabilities.