benchmark

TherapeuticsBench

benchmarkactiveprovisionaltherapeuticsbench-fe17f017·1 events·first seen 2d ago

Aliases: TherapeuticsBench

Co-occurring entities

Claude Opus 4.6 OpenAI TxBench-PP GPT-5.5 Anthropic

More like this (12)

MedMisBench LiveBench MedAgentBench LifeSciBench T1-Bench HealthBench BixBench AdvBench ATE-Bench LabBench TokenBench SupraBench

Recent events (1)

6arXiv · cs.AI·2d ago·source ↗

TxBench-PP: New benchmark reveals AI agents struggle with preclinical pharmacology decisions

Researchers introduce TxBench-PP (TherapeuticsBench Preclinical Pharmacology), a 100-evaluation benchmark testing AI agents on realistic small-molecule drug discovery tasks including mechanism-of-action reasoning, compound-target engagement, and translational efficacy. Agents receive real workflow snapshots and are graded deterministically on structured answers. Across 16 model-harness configurations and 4,800 trajectories, no system reliably succeeded; the best performer, Claude Opus 4.8 with the Pi harness, passed only 59.3% of endpoint attempts. The results suggest current frontier models are not yet deployment-ready for autonomous preclinical pharmacology decision-making.

Evaluation and Benchmarking Agent and Tool Ecosystem Claude Opus 4.6 OpenAI TherapeuticsBench +3 more