6arXiv cs.AI (Artificial Intelligence)·11d ago

EvalCards: A unified reporting layer for AI evaluation results with interpretive signals

Researchers introduce EvalCards, an operational schema and tooling layer that composes benchmark metadata, evaluation run data, and model metadata into a unified, interpretable record for AI evaluation reporting. The system derives a reporting schema from 52 papers and 10 stakeholder interviews, implements four interpretive signals (reproducibility, documentation completeness, provenance/risk, score comparability), and deploys a monitoring tool across 5,816 models, 635 benchmarks, and 101,843 results. The work targets the widespread inconsistency in how evaluation results are reported across leaderboards, model cards, and company blogs, making cross-source comparison unreliable. It addresses a structural gap in the evaluation ecosystem by providing extraction infrastructure, not just a proposal.

Evaluation and Benchmarking AI Safety Research Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting EvalCards

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·5d ago·source ↗

Every Eval Ever: unified schema and community repository for AI evaluation results

Researchers introduce Every Eval Ever, a shared schema and crowdsourced repository designed to standardize AI evaluation results across incompatible formats, frameworks, and sources. The system ingests results from evaluation harnesses, papers, leaderboards, and custom repositories into a single JSON document format, with optional per-instance output storage. The repository, hosted on Hugging Face, currently covers 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats. The work addresses a persistent infrastructure problem in AI evaluation science: divergent scores for nominally identical evaluations and scattered, incomparable metadata.

Evaluation and Benchmarking Agent and Tool Ecosystem Hugging Face Every Eval Ever

5Hugging Face Blog·1mo ago·source ↗

Community Evals: Because we're done trusting black-box leaderboards over the community

Hugging Face introduces Community Evals, a framework aimed at replacing or supplementing opaque black-box leaderboards with community-driven model evaluations. The initiative reflects growing skepticism about the reliability and transparency of existing benchmark leaderboards. By crowdsourcing evaluations, Hugging Face seeks to make model assessment more transparent, diverse, and resistant to gaming. This represents a structural shift in how the open-source AI community approaches model comparison and trust.

Evaluation and Benchmarking Open Weights Progress Open LLM Leaderboard Hugging Face Community Evals

3Hugging Face Blog·1mo ago·source ↗

Hugging Face Blog: Model Cards

This Hugging Face blog post discusses model cards as a documentation standard for machine learning models, covering their purpose, structure, and adoption within the ML community. Model cards provide structured metadata and transparency information about a model's intended use, limitations, training data, and evaluation results. The post likely outlines best practices and tooling support for creating and maintaining model cards on the Hugging Face Hub.

Evaluation and Benchmarking AI Safety Research Model Cards Hugging Face

6arXiv · cs.CL·25d ago·source ↗

AI-Assisted Systematization for Evaluating GenAI Systems

This paper addresses a foundational gap in GenAI evaluation: the underspecification of broad, contested concepts like 'reasoning,' 'fairness,' or 'creativity.' The authors introduce a structured artifact called a 'concept spec' and a validation worksheet, then build two AI-assisted systematizers—a zero-shot approach and a multi-agent approach—to convert vague evaluation targets into measurable, structured accounts. They apply these tools to hate-based rhetoric and digital empathy, assessing the resulting specs on content validity and information recoverability. The work positions AI assistance as a scalable aid for the cognitively demanding process of evaluation design.

Evaluation and Benchmarking AI Safety Research hate-based rhetoric concept spec digital empathy +4 more

5arXiv · cs.CL·29d ago·source ↗

SynAE: Framework for Evaluating Synthetic Data Quality in Tool-Calling Agent Benchmarks

SynAE is a proposed evaluation framework for measuring how well synthetic datasets replicate and augment real data trajectories for multi-turn, tool-calling agent testing. It assesses validity, fidelity, and diversity across four metric categories: task instructions, tool calls, final outputs, and downstream evaluation. The paper demonstrates that no single metric suffices to characterize synthetic data quality, motivating multi-axis evaluation. A demo and code are publicly available.

Evaluation and Benchmarking Agent and Tool Ecosystem multi-turn agent benchmarks tool-calling agents SynAE +1 more

6arXiv · cs.AI·4d ago·source ↗

Bayesian audit framework for public AI evaluation archives challenges frontier model claims

A new arXiv preprint proposes a Bayesian inference and decision-audit framework for interpreting public AI evaluation archives (LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, tau-bench) as longitudinal time series rather than terminal leaderboards. The paper demonstrates that a single terminal snapshot is compatible with multiple distinct performance histories, yielding ambiguous timing estimates for reaching capability ceilings. A candidate selection-aware frontier model is shown to fail synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration, with fixed audit gates rejecting its stronger claims. The work proposes an archive-and-adjudication protocol to reconstruct evaluation histories and falsify unsupported frontier capability claims.

Evaluation and Benchmarking AI Safety Research Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations GAIA Open LLM Leaderboard +3 more

7arXiv · cs.CL·25d ago·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.

Frontier Model Releases Evaluation and Benchmarking NeurIPS Auto Benchmark Audit (ABA)SWE-Bench Verified +2 more

6arXiv · cs.AI·8d ago·source ↗

AgentBeats: Standardized Agent Evaluation via A2A and MCP Protocols

A new arXiv preprint proposes Agentified Agent Assessment (AAA), a framework where evaluation is performed by judge agents interacting through standardized protocols—A2A for task management and MCP for tool access—rather than bespoke benchmark harnesses. The authors introduce AgentBeats as a concrete implementation, validated through a five-month open competition with 298 judge agents and 467 subject agents across 12 categories, plus a coding-agent case study. The work addresses fragmentation in agent evaluation by decoupling assessment logic from agent implementation, enabling reproducible and interoperable benchmarking.

Evaluation and Benchmarking Agent and Tool Ecosystem AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility AgentBeats MCP +1 more