benchmark
TxBench-PP
benchmarkactiveprovisional
txbench-pp-be119ae3·1 events·first seen 3d agoAliases: TxBench-PP
Co-occurring entities
More like this (12)
Recent events (1)
TxBench-PP: New benchmark reveals AI agents struggle with preclinical pharmacology decisions
Researchers introduce TxBench-PP (TherapeuticsBench Preclinical Pharmacology), a 100-evaluation benchmark testing AI agents on realistic small-molecule drug discovery tasks including mechanism-of-action reasoning, compound-target engagement, and translational efficacy. Agents receive real workflow snapshots and are graded deterministically on structured answers. Across 16 model-harness configurations and 4,800 trajectories, no system reliably succeeded; the best performer, Claude Opus 4.8 with the Pi harness, passed only 59.3% of endpoint attempts. The results suggest current frontier models are not yet deployment-ready for autonomous preclinical pharmacology decision-making.