Entity · organization

IBM Research

organizationactiveibm-research-3e0c08ff·9 events·first seen May 18, 2026

Aliases: IBM Research

Co-occurring entities

Hugging Face CUGA ScarfBench Artificial Analysis ITBench-AA AssetOpsBench Open Agent Leaderboard UC Berkeley IT-Bench VAKRA

More like this (12)

Microsoft Research IBM Berkeley Artificial Intelligence Research Facebook AI Research Machine Intelligence Research Institute Berkeley AI Research (BAIR)National AI Research Lab Salesforce Research Apollo Research Apart Research Allen Institute for AI National AI Research Resource

Recent events (9)

4Hugging Face Blog·Jul 15, 2026·source ↗

IBM Research on the hidden complexity of model routing in production AI systems

IBM Research published a blog post on Hugging Face exploring the practical challenges of model routing — the problem of directing inference requests to the most appropriate model given cost, latency, and capability tradeoffs. The piece argues that while routing appears straightforward in principle, production deployments surface significant complexity. This is relevant to practitioners building multi-model inference pipelines and cost-optimization strategies.

Inference Economics Enterprise Deployment Patterns IBM Research Hugging Face

5Hugging Face Blog·Jun 30, 2026·source ↗

IBM Research introduces ScarfBench for evaluating AI agents on enterprise Java framework migration

IBM Research published ScarfBench, a benchmark designed to evaluate AI agents on the task of migrating enterprise Java frameworks. The benchmark targets a concrete, high-value software engineering task relevant to large-scale enterprise codebases. It appears on the Hugging Face blog, suggesting an open release or community-facing artifact.

Evaluation and Benchmarking Agent and Tool Ecosystem IBM Research ScarfBench Hugging Face

4Hugging Face Blog·Jun 23, 2026·source ↗

IBM Research releases CUGA harness with two dozen agentic app examples

IBM Research published a Hugging Face blog post introducing CUGA, a lightweight harness for building agentic applications, accompanied by approximately two dozen working example implementations. The post appears to be a practical demonstration of agent tooling patterns rather than a model release. This is relevant to the agent-tool ecosystem as it provides concrete reference implementations for practitioners building agentic systems.

Agent and Tool Ecosystem IBM Research CUGA

5Hugging Face Blog·May 27, 2026·source ↗

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks

IBM Research and Artificial Analysis have released ITBench-AA, a benchmark targeting agentic AI performance on enterprise IT operations tasks. Frontier models evaluated on the benchmark score below 50%, indicating significant capability gaps in real-world IT automation scenarios. The benchmark appears to be the first of its kind focused specifically on agentic enterprise IT workflows, covering tasks relevant to site reliability engineering and IT operations.

Evaluation and Benchmarking Enterprise Deployment Patterns IBM Research Artificial Analysis ITBench-AA +2 more

4Hugging Face Blog·May 19, 2026·source ↗

CUGA on Hugging Face: Democratizing Configurable AI Agents

IBM Research has released CUGA (Configurable Universal Generative Agent) on Hugging Face, positioning it as a framework for building configurable AI agents. The announcement appears on the Hugging Face blog as a tier-2 commentary piece from IBM Research. Details on architecture, benchmarks, and specific capabilities are not available from the body text provided.

Enterprise Deployment Patterns Agent and Tool Ecosystem IBM Research Hugging Face CUGA

4Hugging Face Blog·May 19, 2026·source ↗

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

IBM Research introduces AssetOpsBench, a benchmark designed to evaluate AI agents on industrial asset operations tasks, hosted on Hugging Face. The benchmark targets the gap between existing general-purpose agent benchmarks and real-world industrial deployment scenarios. It provides a playground environment for testing agent capabilities in enterprise/industrial contexts.

Evaluation and Benchmarking Enterprise Deployment Patterns IBM Research AssetOpsBench Hugging Face +1 more

5Hugging Face Blog·May 18, 2026·source ↗

The Open Agent Leaderboard

IBM Research and Hugging Face have launched the Open Agent Leaderboard, a public benchmark for evaluating AI agents across standardized tasks. The leaderboard aims to provide transparent, reproducible comparisons of open and proprietary agent systems. This initiative addresses the growing need for rigorous evaluation infrastructure as the agent ecosystem matures.

Evaluation and Benchmarking Agent and Tool Ecosystem IBM Research Hugging Face Open Agent Leaderboard

5Hugging Face Blog·May 18, 2026·source ↗

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

IBM Research and UC Berkeley have released IT-Bench and MAST, a benchmark suite and diagnostic framework aimed at evaluating why AI agents fail in enterprise IT environments. The work targets realistic IT operations tasks such as incident response, service management, and infrastructure automation. By categorizing failure modes systematically, MAST provides a structured taxonomy for understanding agent shortcomings beyond simple pass/fail metrics. This addresses a gap in enterprise-focused agent evaluation, where general benchmarks often fail to capture domain-specific complexity.

IBM Research UC Berkeley IT-Bench

5Hugging Face Blog·May 18, 2026·source ↗

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

IBM Research presents an analysis of VAKRA, a benchmark designed to evaluate agentic AI systems on reasoning and tool use capabilities. The post examines how agents fail across different task categories, surfacing systematic failure modes in multi-step reasoning and tool invocation. The analysis provides diagnostic insights into where current agent architectures break down under realistic task conditions.

Evaluation and Benchmarking AI Safety Research IBM Research Hugging Face VAKRA +1 more