Trace & Evaluate your Agent with Arize Phoenix
This Hugging Face blog post describes integrating Arize Phoenix with the smolagents framework to enable tracing and evaluation of AI agents. The post covers how to instrument agent runs, capture traces, and assess agent behavior using Phoenix's observability tooling. It targets developers building and debugging agentic pipelines who need visibility into multi-step reasoning and tool use.
Related guides (3)
Related events (8)
Gaia2 and ARE: Empowering the community to study agents
Hugging Face has released Gaia2 and the Agent Reasoning Evaluation (ARE) framework, aimed at enabling the research community to study and benchmark AI agents. The post describes new tools and datasets for evaluating agent capabilities, building on the original GAIA benchmark. This represents an expansion of the agent evaluation ecosystem with community-oriented tooling.
AgentBeats: Standardized Agent Evaluation via A2A and MCP Protocols
A new arXiv preprint proposes Agentified Agent Assessment (AAA), a framework where evaluation is performed by judge agents interacting through standardized protocols—A2A for task management and MCP for tool access—rather than bespoke benchmark harnesses. The authors introduce AgentBeats as a concrete implementation, validated through a five-month open competition with 298 judge agents and 467 subject agents across 12 categories, plus a coding-agent case study. The work addresses fragmentation in agent evaluation by decoupling assessment logic from agent implementation, enabling reproducible and interoperable benchmarking.
Hugging Face Transformers Code Agent Beats GAIA Benchmark
Hugging Face reports that their Transformers-based code agent has achieved a top score on the GAIA benchmark, a challenging evaluation for general AI assistants requiring multi-step reasoning and tool use. The result positions Hugging Face's open agent framework competitively against proprietary systems. The post details the agent architecture and tooling approach used to achieve the result.
Tiny Agents: an MCP-powered agent in 50 lines of code
Hugging Face published a blog post demonstrating how to build a minimal AI agent using the Model Context Protocol (MCP) in approximately 50 lines of code. The post showcases how MCP enables agents to discover and invoke tools dynamically, reducing the boilerplate required for agentic workflows. This serves as both a tutorial and a commentary on MCP's role in simplifying agent-tool integration in the current ecosystem.
Mistral AI Launches Agents API with Built-in Connectors, MCP Tools, and Persistent Memory
Mistral AI has released a dedicated Agents API that extends beyond chat completion by providing built-in connectors for code execution, web search, image generation, and document retrieval, alongside support for Model Context Protocol (MCP) tools. The API features stateful conversation management with branching, streaming output, and multi-agent orchestration capabilities. Benchmark results show substantial web search augmentation gains: Mistral Large jumps from 23% to 75% on SimpleQA, and Mistral Medium from 22% to 82% with search enabled. The release targets enterprise-grade agentic workflows and is accompanied by cookbooks covering GitHub coding assistants, financial analysis, and travel planning use cases.
AI Agents Are Here. What Now?
A Hugging Face Ethics and Society blog post examines the current state of AI agents and the ethical, safety, and societal questions they raise. The piece likely covers concerns around autonomous decision-making, accountability, and deployment risks as agentic systems become more prevalent. Published in January 2025, it reflects growing institutional attention to agent-specific risks beyond general AI safety.
Introducing Agents.js: Give tools to your LLMs using JavaScript
Hugging Face released Agents.js, a JavaScript library that enables developers to equip large language models with tools and build agent workflows in a JS/TS environment. The library brings tool-use and agent orchestration capabilities—previously more common in Python ecosystems—to the JavaScript developer community. It integrates with Hugging Face's model hub and inference APIs.
OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments
This Hugging Face blog post introduces OpenEnv, a framework for evaluating tool-using AI agents in real-world environments. The piece appears to address the challenge of benchmarking agentic systems that interact with external tools and environments, moving beyond static benchmarks toward dynamic, practical evaluation settings. As a tier-2 commentary piece, it likely discusses methodology, design choices, and results from applying OpenEnv to assess agent capabilities.


