Almanac
← Events
5Hugging Face Blog·1mo ago

The Open Agent Leaderboard

IBM Research and Hugging Face have launched the Open Agent Leaderboard, a public benchmark for evaluating AI agents across standardized tasks. The leaderboard aims to provide transparent, reproducible comparisons of open and proprietary agent systems. This initiative addresses the growing need for rigorous evaluation infrastructure as the agent ecosystem matures.

Related guides (3)

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

IBM Research introduces AssetOpsBench, a benchmark designed to evaluate AI agents on industrial asset operations tasks, hosted on Hugging Face. The benchmark targets the gap between existing general-purpose agent benchmarks and real-world industrial deployment scenarios. It provides a playground environment for testing agent capabilities in enterprise/industrial contexts.

5Hugging Face Blog·1mo ago·source ↗

Building the Open Agent Ecosystem Together: Introducing OpenEnv

Hugging Face has announced OpenEnv, an initiative aimed at building an open ecosystem for AI agents. The project appears to focus on standardizing and sharing environments for agent training and evaluation. As a tier-2 source commentary piece, it signals Hugging Face's continued investment in the agent tooling space and open-source agent infrastructure.

6arXiv · cs.AI·15d ago·source ↗

Benchmark Agent: Autonomous system for end-to-end benchmark construction

Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.

5Hugging Face Blog·1mo ago·source ↗

OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments

This Hugging Face blog post introduces OpenEnv, a framework for evaluating tool-using AI agents in real-world environments. The piece appears to address the challenge of benchmarking agentic systems that interact with external tools and environments, moving beyond static benchmarks toward dynamic, practical evaluation settings. As a tier-2 commentary piece, it likely discusses methodology, design choices, and results from applying OpenEnv to assess agent capabilities.

4Hugging Face Blog·1mo ago·source ↗

Introducing AI vs. AI: A Deep Reinforcement Learning Multi-Agent Competition System

Hugging Face has launched 'AI vs. AI', a competition framework for evaluating deep reinforcement learning agents through head-to-head multi-agent matchups. The system is designed to benchmark RL agents against each other in competitive environments rather than static benchmarks. This represents a new evaluation paradigm for RL research hosted on the Hugging Face platform.

6Openai Blog·1mo ago·source ↗

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

OpenAI introduces MLE-bench, a benchmark designed to measure AI agent performance on machine learning engineering tasks. The benchmark draws from Kaggle competitions to evaluate agents on realistic ML engineering workflows. Initial results show that current agents, including those powered by o1-preview, achieve competitive performance on a subset of tasks but fall well short of top human competitors. The benchmark is intended to track progress in agentic ML capabilities over time.

6Hugging Face Blog·1mo ago·source ↗

Hugging Face Transformers Code Agent Beats GAIA Benchmark

Hugging Face reports that their Transformers-based code agent has achieved a top score on the GAIA benchmark, a challenging evaluation for general AI assistants requiring multi-step reasoning and tool use. The result positions Hugging Face's open agent framework competitively against proprietary systems. The post details the agent architecture and tooling approach used to achieve the result.

7Openai Blog·1mo ago·source ↗

PaperBench: OpenAI Benchmark for Evaluating AI Agents on Research Replication

OpenAI introduces PaperBench, a benchmark designed to evaluate AI agents' ability to replicate state-of-the-art AI research papers end-to-end. The benchmark targets a high-complexity capability: reproducing experimental results from frontier AI research, which requires code generation, experimental design, and scientific reasoning. This positions PaperBench as a tool for tracking progress toward autonomous AI research agents.