Almanac
benchmark

multi-turn agent benchmarks

benchmarkactivemulti-turn-agent-benchmarks-8a1b610c·1 events·first seen 25d ago

Aliases: multi-turn agent benchmarks

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.CL·25d ago·source ↗

SynAE: Framework for Evaluating Synthetic Data Quality in Tool-Calling Agent Benchmarks

SynAE is a proposed evaluation framework for measuring how well synthetic datasets replicate and augment real data trajectories for multi-turn, tool-calling agent testing. It assesses validity, fidelity, and diversity across four metric categories: task instructions, tool calls, final outputs, and downstream evaluation. The paper demonstrates that no single metric suffices to characterize synthetic data quality, motivating multi-axis evaluation. A demo and code are publicly available.