benchmark

TxBench-PP

benchmarkactiveprovisionaltxbench-pp-be119ae3·1 events·first seen 3d ago

Aliases: TxBench-PP

Co-occurring entities

Claude Opus 4.6 OpenAI TherapeuticsBench GPT-5.5 Anthropic

More like this (12)

SPBench ITBench-AA RepoBench T1-Bench BixBench AdvBench PseudoBench ATE-Bench MTBench SupraBench IFBench RoleBench

Recent events (1)

6arXiv · cs.AI·3d ago·source ↗

TxBench-PP: New benchmark reveals AI agents struggle with preclinical pharmacology decisions

Researchers introduce TxBench-PP (TherapeuticsBench Preclinical Pharmacology), a 100-evaluation benchmark testing AI agents on realistic small-molecule drug discovery tasks including mechanism-of-action reasoning, compound-target engagement, and translational efficacy. Agents receive real workflow snapshots and are graded deterministically on structured answers. Across 16 model-harness configurations and 4,800 trajectories, no system reliably succeeded; the best performer, Claude Opus 4.8 with the Pi harness, passed only 59.3% of endpoint attempts. The results suggest current frontier models are not yet deployment-ready for autonomous preclinical pharmacology decision-making.

Evaluation and Benchmarking Agent and Tool Ecosystem Claude Opus 4.6 OpenAI TherapeuticsBench +3 more