benchmark
multi-turn agent benchmarks
benchmarkactive
multi-turn-agent-benchmarks-8a1b610c·1 events·first seen 25d agoAliases: multi-turn agent benchmarks
Co-occurring entities
More like this (12)
Super-Agent benchmarkmulti-level agent evaluationMulti-Turn Evaluation of Deep Research Agents Under Process-Level FeedbackBenchmark Agentmulti-agent cooperative frameworkVals AI Finance Agent Benchmarkmulti-agent systematizeragent-to-agent evaluation protocolmulti-turn language modelsReward Modeling for Multi-Agent OrchestrationMemoryAgentBenchLegal Agent Benchmark
Recent events (1)
SynAE: Framework for Evaluating Synthetic Data Quality in Tool-Calling Agent Benchmarks
SynAE is a proposed evaluation framework for measuring how well synthetic datasets replicate and augment real data trajectories for multi-turn, tool-calling agent testing. It assesses validity, fidelity, and diversity across four metric categories: task instructions, tool calls, final outputs, and downstream evaluation. The paper demonstrates that no single metric suffices to characterize synthetic data quality, motivating multi-axis evaluation. A demo and code are publicly available.