Entity · benchmark

multi-turn agent benchmarks

benchmarkactivemulti-turn-agent-benchmarks-8a1b610c·1 events·first seen May 22, 2026

Aliases: multi-turn agent benchmarks

Co-occurring entities

tool-calling agents SynAE synthetic data evaluation

More like this (12)

Super-Agent benchmark multi-level agent evaluation Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback Benchmark Agent multi-agent cooperative framework Do Agent Optimizers Compound? A Continual-Learning Evaluation on Terminal-Bench 2.0 Vals AI Finance Agent Benchmark multi-agent systematizer agent-to-agent evaluation protocol multi-turn language models Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability Reward Modeling for Multi-Agent Orchestration

Recent events (1)

5arXiv · cs.CL·May 22, 2026·source ↗

SynAE: Framework for Evaluating Synthetic Data Quality in Tool-Calling Agent Benchmarks

SynAE is a proposed evaluation framework for measuring how well synthetic datasets replicate and augment real data trajectories for multi-turn, tool-calling agent testing. It assesses validity, fidelity, and diversity across four metric categories: task instructions, tool calls, final outputs, and downstream evaluation. The paper demonstrates that no single metric suffices to characterize synthetic data quality, motivating multi-axis evaluation. A demo and code are publicly available.

Evaluation and Benchmarking Agent and Tool Ecosystem multi-turn agent benchmarks tool-calling agents SynAE +1 more