Almanac
← Events
5AI Snake Oil·1mo ago

New Paper: Towards a Science of AI Agent Reliability

A new paper proposes a framework for quantifying the gap between AI agent capability and reliability, aiming to establish a more rigorous science of agent dependability. The work addresses the observation that agents may demonstrate high capability on benchmarks while failing unpredictably in deployment. The piece is published via the normaltech.ai newsletter, associated with the AI Snake Oil research commentary tradition.

Related guides (3)

Related events (8)

6Ai Snake Oil·1mo ago·source ↗

New paper: AI agents that matter

A paper from the AI Snake Oil / Normal Tech group critiques current AI agent benchmarking and evaluation practices. The work argues that existing agent benchmarks are poorly designed for assessing real-world utility, and calls for rethinking how agent performance is measured. The commentary targets the gap between benchmark scores and practical deployment value.

4Import Ai·1mo ago·source ↗

Import AI 441: My agents are working. Are yours?

Import AI issue 441 covers developments in AI agents and AI system security, including a discussion of agent reliability and a segment on corrupting AI systems via 'poison fountain' attacks. As a tier-2 newsletter commentary, it synthesizes recent developments across the AI/ML landscape. The dual focus on agent deployment status and adversarial data poisoning reflects two active research and deployment concerns.

4One Useful Thing·1mo ago·source ↗

Giving your AI a Job Interview

This commentary piece argues that as AI-generated advice becomes more consequential, users need systematic methods to evaluate AI reliability and quality—analogous to a job interview process. The author proposes frameworks for assessing AI outputs before trusting them for important decisions. The piece addresses the practical challenge of calibrating trust in AI systems across different use cases.

7Openai Blog·1mo ago·source ↗

PaperBench: OpenAI Benchmark for Evaluating AI Agents on Research Replication

OpenAI introduces PaperBench, a benchmark designed to evaluate AI agents' ability to replicate state-of-the-art AI research papers end-to-end. The benchmark targets a high-complexity capability: reproducing experimental results from frontier AI research, which requires code generation, experimental design, and scientific reasoning. This positions PaperBench as a tool for tracking progress toward autonomous AI research agents.

4One Useful Thing·1mo ago·source ↗

Real AI Agents and Real Work

A commentary piece from One Useful Thing examining the practical deployment of AI agents in real work contexts, framing the tension between human-centered work and AI-generated productivity outputs. The piece appears to analyze how autonomous AI agents are changing knowledge work workflows. Published by a Tier 2 source known for applied AI analysis aimed at practitioners and researchers.

4Ai Snake Oil·1mo ago·source ↗

Can AI automate computational reproducibility?

This commentary introduces a new benchmark aimed at measuring AI's ability to automate computational reproducibility in scientific research. The piece examines whether AI systems can reliably re-execute and validate scientific computations, a key bottleneck in research integrity. It frames reproducibility automation as a concrete, measurable capability for evaluating AI's impact on science.

6Openai Blog·1mo ago·source ↗

AI Safety via Debate

OpenAI proposes a safety technique in which two AI agents debate a topic and a human judge determines the winner, with the goal of making it easier for humans to supervise AI systems that may be more capable than themselves. The core intuition is that it is easier to verify a correct argument than to generate one, so a dishonest agent can be caught by an honest opponent. The paper introduces debate as a scalable oversight mechanism applicable to complex tasks where direct human evaluation is infeasible.

3Import Ai·1mo ago·source ↗

Import AI 447: The AGI Economy, AI-Generated Game Testing, and Agent Ecologies

Import AI issue 447 covers speculative analysis of AGI economic structures, including the concept of a 'superintelligence arcology,' alongside coverage of using procedurally generated games to evaluate AI capabilities and discussion of emergent agent ecologies. The newsletter synthesizes recent developments across frontier AI, evaluation methodology, and multi-agent systems. As a tier-2 commentary source, it provides synthesis and framing rather than primary research.