6AI Snake Oil·1mo ago

New paper: AI agents that matter

A paper from the AI Snake Oil / Normal Tech group critiques current AI agent benchmarking and evaluation practices. The work argues that existing agent benchmarks are poorly designed for assessing real-world utility, and calls for rethinking how agent performance is measured. The commentary targets the gap between benchmark scores and practical deployment value.

Evaluation and Benchmarking Agent and Tool Ecosystem AI agents that matter Normal Tech AI Snake Oil

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5Ai Snake Oil·1mo ago·source ↗

New Paper: Towards a Science of AI Agent Reliability

A new paper proposes a framework for quantifying the gap between AI agent capability and reliability, aiming to establish a more rigorous science of agent dependability. The work addresses the observation that agents may demonstrate high capability on benchmarks while failing unpredictably in deployment. The piece is published via the normaltech.ai newsletter, associated with the AI Snake Oil research commentary tradition.

Evaluation and Benchmarking AI Safety Research Towards a Science of AI Agent Reliability normaltech.ai AI Snake Oil +2 more

4One Useful Thing·1mo ago·source ↗

Real AI Agents and Real Work

A commentary piece from One Useful Thing examining the practical deployment of AI agents in real work contexts, framing the tension between human-centered work and AI-generated productivity outputs. The piece appears to analyze how autonomous AI agents are changing knowledge work workflows. Published by a Tier 2 source known for applied AI analysis aimed at practitioners and researchers.

Enterprise Deployment Patterns Agent and Tool Ecosystem One Useful Thing

7Openai Blog·1mo ago·source ↗

PaperBench: OpenAI Benchmark for Evaluating AI Agents on Research Replication

OpenAI introduces PaperBench, a benchmark designed to evaluate AI agents' ability to replicate state-of-the-art AI research papers end-to-end. The benchmark targets a high-complexity capability: reproducing experimental results from frontier AI research, which requires code generation, experimental design, and scientific reasoning. This positions PaperBench as a tool for tracking progress toward autonomous AI research agents.

Evaluation and Benchmarking AI Safety Research OpenAI PaperBench +1 more

4Ai Snake Oil·1mo ago·source ↗

AI as Normal Technology

A paper by the AI Snake Oil authors argues that AI should be understood as 'normal technology' rather than as something categorically unprecedented, a framing they plan to expand into a book. The piece appears to challenge dominant narratives about AI exceptionalism. The body is minimal, suggesting this is a teaser or announcement for forthcoming work.

AI Safety Research Regulatory Developments AI as Normal Technology normaltech.ai AI Snake Oil

7arXiv · cs.CL·26d ago·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.

Frontier Model Releases Evaluation and Benchmarking NeurIPS Auto Benchmark Audit (ABA)SWE-Bench Verified +2 more

4Ai Snake Oil·10d ago·source ↗

Why AI hasn't replaced software engineers, and won't

A commentary piece from the AI Snake Oil / Normal Tech newsletter argues that coding agents should be understood as normal technology rather than transformative replacements for software engineers. The piece examines why AI has not displaced software engineering roles despite significant capability advances. This is a skeptical industry analysis relevant to ongoing debates about AI's impact on software development labor.

Enterprise Deployment Patterns Agent and Tool Ecosystem AI Snake Oil

4Import Ai·1mo ago·source ↗

Import AI 441: My agents are working. Are yours?

Import AI issue 441 covers developments in AI agents and AI system security, including a discussion of agent reliability and a segment on corrupting AI systems via 'poison fountain' attacks. As a tier-2 newsletter commentary, it synthesizes recent developments across the AI/ML landscape. The dual focus on agent deployment status and adversarial data poisoning reflects two active research and deployment concerns.

AI Safety Research Agent and Tool Ecosystem poison fountain attack Jack Clark Import AI

4Hugging Face Blog·1mo ago·source ↗

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

IBM Research introduces AssetOpsBench, a benchmark designed to evaluate AI agents on industrial asset operations tasks, hosted on Hugging Face. The benchmark targets the gap between existing general-purpose agent benchmarks and real-world industrial deployment scenarios. It provides a playground environment for testing agent capabilities in enterprise/industrial contexts.

Evaluation and Benchmarking Enterprise Deployment Patterns IBM Research AssetOpsBench Hugging Face +1 more