Almanac
← Events
5arXiv cs.AI (Artificial Intelligence)·22d ago

RoboWits: Benchmark for Robotic Creative Problem Solving Under Unexpected Conditions

RoboWits is a new bi-manual robotic benchmark designed to evaluate cognitive reasoning, creative tool use, and robustness to unexpected conditions in robotics. The authors introduce an automated multi-agent task generation pipeline that produces 30 seed tasks and 208 mutated tasks spanning geometry, material, and assembly-based reasoning. Benchmarking results show that pre-trained Vision-Language-Action models (VLAs) achieve limited success on seed tasks after fine-tuning but fail on mutated variants, exposing brittleness in reasoning and strategy adaptation. The benchmark highlights a significant gap between skill-level execution and genuine cognitive reasoning in current robotic systems.

Related guides (3)

Related events (8)

6arXiv · cs.AI·15d ago·source ↗

Benchmark Agent: Autonomous system for end-to-end benchmark construction

Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.

6arXiv · cs.CL·11d ago·source ↗

SpatialWorld benchmark evaluates interactive spatial reasoning of multimodal agents in real-world tasks

Researchers introduce SpatialWorld, a benchmark for evaluating interactive spatial understanding of multimodal agents across 760 human-annotated tasks spanning household, travel, and social domains. The benchmark integrates eight simulation backends under a shared protocol, requiring agents to operate under vision-only partial observability with egocentric inputs. Evaluation of 15 agents reveals that even the strongest model, GPT-5, achieves only 17.4% task success rate, exposing significant gaps in active exploration and long-horizon planning. The work highlights a mismatch between task success and execution efficiency as a key bottleneck for spatial agents.

5Hugging Face Blog·1mo ago·source ↗

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

IBM Research presents an analysis of VAKRA, a benchmark designed to evaluate agentic AI systems on reasoning and tool use capabilities. The post examines how agents fail across different task categories, surfacing systematic failure modes in multi-step reasoning and tool invocation. The analysis provides diagnostic insights into where current agent architectures break down under realistic task conditions.

7arXiv · cs.CL·25d ago·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.

7Openai Blog·1mo ago·source ↗

PaperBench: OpenAI Benchmark for Evaluating AI Agents on Research Replication

OpenAI introduces PaperBench, a benchmark designed to evaluate AI agents' ability to replicate state-of-the-art AI research papers end-to-end. The benchmark targets a high-complexity capability: reproducing experimental results from frontier AI research, which requires code generation, experimental design, and scientific reasoning. This positions PaperBench as a tool for tracking progress toward autonomous AI research agents.

4arXiv · cs.CL·10d ago·source ↗

T1-Bench: Multi-scenario agent benchmark across 25 real-world domains

T1-Bench is a new benchmark for evaluating agentic LLM systems in realistic customer-facing, multi-domain environments, covering 25 domains of varying difficulty with interleaved multi-turn scenarios. The authors evaluate 12 proprietary and open-weight models and combine automatic evaluation with human judgments. The benchmark targets gaps in existing agent evals around task complexity, domain diversity, and compositional reasoning across multi-step interactions.

6arXiv · cs.CL·8d ago·source ↗

LabVLA: Vision-Language-Action model and RoboGenesis data engine for scientific laboratory robotics

Researchers introduce LabVLA, a Vision-Language-Action model designed to bridge written scientific protocols and physical robot execution in laboratory settings. To address the data scarcity problem, they build RoboGenesis, a simulation-based data engine that composes lab workflows from atomic skills and generates structured demonstrations across robot embodiments. LabVLA uses a two-stage training recipe combining FAST action token pretraining on a Qwen3-VL-4B-Instruct backbone with flow matching posttraining via a DiT action expert. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among evaluated baselines in both in-distribution and out-of-distribution settings.

7arXiv · cs.AI·16d ago·source ↗

AutoLab benchmark evaluates frontier models on ultra long-horizon iterative research and engineering tasks

AutoLab is a new benchmark of 36 expert-curated tasks across system optimization, puzzle-solving, model development, and CUDA kernel optimization, designed to test agents on sustained closed-loop improvement under wall-clock budgets rather than single-turn or short-horizon settings. Evaluation of 17 frontier models finds that persistence in iterative benchmarking and feedback incorporation — not initial attempt quality — is the dominant success predictor. Claude Opus 4.6 stands out as the strongest performer, while most models including proprietary ones either terminate early or exhaust budgets with minimal progress. The benchmark, harness, and task artifacts are open-sourced.