benchmark
VAKRA
benchmarkactive
vakra-3a9922fc·1 events·first seen 1mo agoAliases: VAKRA
Co-occurring entities
More like this (12)
Recent events (1)
Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents
IBM Research presents an analysis of VAKRA, a benchmark designed to evaluate agentic AI systems on reasoning and tool use capabilities. The post examines how agents fail across different task categories, surfacing systematic failure modes in multi-step reasoning and tool invocation. The analysis provides diagnostic insights into where current agent architectures break down under realistic task conditions.