Almanac
← Events
7arXiv cs.CL (Computation and Language)·3d ago

PseudoBench: Benchmark reveals agentic AI research systems readily produce pseudoscientific outputs

PseudoBench is a new adversarial benchmark evaluating whether agentic auto-research systems can identify and resist pseudoscientific narratives, containing 200 curated claim-evidence pairs across five domains. Testing seven state-of-the-art agents, the authors find near-zero refusal rates and a maximum resistance rate of only 27.4%, meaning current systems readily generate persuasive pseudoscientific reports. A notable finding is that stronger agents package pseudoscience in more sophisticated language, increasing its apparent credibility rather than reducing harm. The authors call for 'scientific alignment' as a prerequisite for deploying autonomous research agents.

Related guides (3)

Related events (8)

6arXiv · cs.AI·15d ago·source ↗

Benchmark Agent: Autonomous system for end-to-end benchmark construction

Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.

7Openai Blog·1mo ago·source ↗

PaperBench: OpenAI Benchmark for Evaluating AI Agents on Research Replication

OpenAI introduces PaperBench, a benchmark designed to evaluate AI agents' ability to replicate state-of-the-art AI research papers end-to-end. The benchmark targets a high-complexity capability: reproducing experimental results from frontier AI research, which requires code generation, experimental design, and scientific reasoning. This positions PaperBench as a tool for tracking progress toward autonomous AI research agents.

6arXiv · cs.AI·2d ago·source ↗

TxBench-PP: New benchmark reveals AI agents struggle with preclinical pharmacology decisions

Researchers introduce TxBench-PP (TherapeuticsBench Preclinical Pharmacology), a 100-evaluation benchmark testing AI agents on realistic small-molecule drug discovery tasks including mechanism-of-action reasoning, compound-target engagement, and translational efficacy. Agents receive real workflow snapshots and are graded deterministically on structured answers. Across 16 model-harness configurations and 4,800 trajectories, no system reliably succeeded; the best performer, Claude Opus 4.8 with the Pi harness, passed only 59.3% of endpoint attempts. The results suggest current frontier models are not yet deployment-ready for autonomous preclinical pharmacology decision-making.

6arXiv · cs.AI·12d ago·source ↗

AARRI-Bench evaluates frontier LLMs and agents on granular research-intern-level tasks

Researchers introduce AARR (Act As a Real Researcher), a new benchmark series targeting whether AI agents can emulate the professionalism, thoroughness, and nuanced judgment of human researchers in granular research scenarios—not just macro-level task execution. The first benchmark, AARRI-Bench, tests frontier models and agentic harnesses, finding that even the best configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3% success, frequently missing subtle but critical details obvious to human researchers. The work argues that closing the gap requires deeper modeling of research behavior rather than more complex scaffolding.

7arXiv · cs.CL·25d ago·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.

6Ai Snake Oil·1mo ago·source ↗

New paper: AI agents that matter

A paper from the AI Snake Oil / Normal Tech group critiques current AI agent benchmarking and evaluation practices. The work argues that existing agent benchmarks are poorly designed for assessing real-world utility, and calls for rethinking how agent performance is measured. The commentary targets the gap between benchmark scores and practical deployment value.

8arXiv · cs.AI·10d ago·source ↗

ABC-Bench: Agentic biosecurity benchmark finds LLM agents surpass median expert humans on dual-use biology tasks

Researchers introduce ABC-Bench, a benchmark evaluating LLM agents on biosecurity-relevant biology tasks including liquid-handling robot programming, DNA fragment design, and evasion of DNA synthesis screening. All tested agents outperformed the median expert human baseline across all three tasks. Wet-lab validation confirmed that OpenAI's o4-mini-high produced scripts that successfully assembled DNA on an OpenTrons robot. The results highlight a meaningful shift in the biosecurity risk landscape as AI agents acquire practical wet-lab-adjacent capabilities.

7Openai Blog·1mo ago·source ↗

OpenAI Introduces FrontierScience Benchmark for Scientific Research Tasks

OpenAI has released FrontierScience, a new benchmark designed to evaluate AI reasoning capabilities across physics, chemistry, and biology. The benchmark is intended to measure progress toward AI systems capable of performing real scientific research tasks. This represents OpenAI's effort to establish a rigorous evaluation framework for frontier-level scientific reasoning, going beyond standard academic problem sets.