4Hugging Face Blog·1mo ago

Trace & Evaluate your Agent with Arize Phoenix

This Hugging Face blog post describes integrating Arize Phoenix with the smolagents framework to enable tracing and evaluation of AI agents. The post covers how to instrument agent runs, capture traces, and assess agent behavior using Phoenix's observability tooling. It targets developers building and debugging agentic pipelines who need visibility into multi-step reasoning and tool use.

Evaluation and Benchmarking Agent and Tool Ecosystem Arize Phoenix Hugging Face smolagents Arize AI

Related guides (3)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6Hugging Face Blog·1mo ago·source ↗

Gaia2 and ARE: Empowering the community to study agents

Hugging Face has released Gaia2 and the Agent Reasoning Evaluation (ARE) framework, aimed at enabling the research community to study and benchmark AI agents. The post describes new tools and datasets for evaluating agent capabilities, building on the original GAIA benchmark. This represents an expansion of the agent evaluation ecosystem with community-oriented tooling.

Evaluation and Benchmarking Agent and Tool Ecosystem GAIA2 GAIA Hugging Face +1 more

6arXiv · cs.AI·9d ago·source ↗

AgentBeats: Standardized Agent Evaluation via A2A and MCP Protocols

A new arXiv preprint proposes Agentified Agent Assessment (AAA), a framework where evaluation is performed by judge agents interacting through standardized protocols—A2A for task management and MCP for tool access—rather than bespoke benchmark harnesses. The authors introduce AgentBeats as a concrete implementation, validated through a five-month open competition with 298 judge agents and 467 subject agents across 12 categories, plus a coding-agent case study. The work addresses fragmentation in agent evaluation by decoupling assessment logic from agent implementation, enabling reproducible and interoperable benchmarking.

Evaluation and Benchmarking Agent and Tool Ecosystem AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility AgentBeats MCP +1 more

6Hugging Face Blog·1mo ago·source ↗

Hugging Face Transformers Code Agent Beats GAIA Benchmark

Hugging Face reports that their Transformers-based code agent has achieved a top score on the GAIA benchmark, a challenging evaluation for general AI assistants requiring multi-step reasoning and tool use. The result positions Hugging Face's open agent framework competitively against proprietary systems. The post details the agent architecture and tooling approach used to achieve the result.

Evaluation and Benchmarking Open Weights Progress Transformers Code Agent GAIA Hugging Face +1 more

4Hugging Face Blog·1mo ago·source ↗

Tiny Agents: an MCP-powered agent in 50 lines of code

Hugging Face published a blog post demonstrating how to build a minimal AI agent using the Model Context Protocol (MCP) in approximately 50 lines of code. The post showcases how MCP enables agents to discover and invoke tools dynamically, reducing the boilerplate required for agentic workflows. This serves as both a tutorial and a commentary on MCP's role in simplifying agent-tool integration in the current ecosystem.

Agent and Tool Ecosystem Hugging Face Tiny Agents Model Context Protocol

7Mistral Ai News·1mo ago·source ↗

Mistral AI Launches Agents API with Built-in Connectors, MCP Tools, and Persistent Memory

Mistral AI has released a dedicated Agents API that extends beyond chat completion by providing built-in connectors for code execution, web search, image generation, and document retrieval, alongside support for Model Context Protocol (MCP) tools. The API features stateful conversation management with branching, streaming output, and multi-agent orchestration capabilities. Benchmark results show substantial web search augmentation gains: Mistral Large jumps from 23% to 75% on SimpleQA, and Mistral Medium from 22% to 82% with search enabled. The release targets enterprise-grade agentic workflows and is accompanied by cookbooks covering GitHub coding assistants, financial analysis, and travel planning use cases.

Frontier Model Releases Inference Economics Mistral AI GitHub Devstral 2 +9 more

4Hugging Face Blog·1mo ago·source ↗

AI Agents Are Here. What Now?

A Hugging Face Ethics and Society blog post examines the current state of AI agents and the ethical, safety, and societal questions they raise. The piece likely covers concerns around autonomous decision-making, accountability, and deployment risks as agentic systems become more prevalent. Published in January 2025, it reflects growing institutional attention to agent-specific risks beyond general AI safety.

AI Safety Research Agent and Tool Ecosystem AI Agents Hugging Face Ethics and Society Team Hugging Face

4Hugging Face Blog·1mo ago·source ↗

Introducing Agents.js: Give tools to your LLMs using JavaScript

Hugging Face released Agents.js, a JavaScript library that enables developers to equip large language models with tools and build agent workflows in a JS/TS environment. The library brings tool-use and agent orchestration capabilities—previously more common in Python ecosystems—to the JavaScript developer community. It integrates with Hugging Face's model hub and inference APIs.

Agent and Tool Ecosystem JavaScript Agents.js Hugging Face

5Hugging Face Blog·1mo ago·source ↗

OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments

This Hugging Face blog post introduces OpenEnv, a framework for evaluating tool-using AI agents in real-world environments. The piece appears to address the challenge of benchmarking agentic systems that interact with external tools and environments, moving beyond static benchmarks toward dynamic, practical evaluation settings. As a tier-2 commentary piece, it likely discusses methodology, design choices, and results from applying OpenEnv to assess agent capabilities.

Evaluation and Benchmarking Agent and Tool Ecosystem Hugging Face OpenEnv