5arXiv cs.AI (Artificial Intelligence)·22h ago

Grading the Grader: Evaluating automated graders for agentic data analysis systems

A preprint from arXiv investigates the reliability of automated graders for evaluating agentic data analysis systems, which produce complex multi-modal outputs (code, numerical results, diagnostics) that are harder to assess than single-turn LLM responses. The authors apply LAMBDA, a multi-agent data analysis system, to 153 numerical tasks from DSGym and develop a three-layer human-AI grading cascade combining regex matching, LLM-based lenient grading, and human inspection. Key findings include: both automated graders achieve 100% precision, a keyword-anchored extraction pipeline raises strict grader recall by 60 percentage points, and an iterative nudge mechanism raises grading success from 36% to 97%. The work surfaces important methodological lessons for anyone building evaluation pipelines for agentic systems.

Evaluation and Benchmarking Agent and Tool Ecosystem Grading the Grader: Lessons from Evaluating an Agentic Data Analysis System DSGym LAMBDA QRData

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·1mo ago·source ↗

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Agentic CLEAR is an automatic evaluation framework for LLM-based agentic systems that analyzes behavior at three granularity levels: system, trace, and node. Unlike existing tools that rely on static error taxonomies or focus only on observability, it dynamically generates textual insights and integrates above the observability layer with an accessible UI. Experiments across four benchmarks and seven agentic settings demonstrate strong alignment with human-annotated errors and predictive accuracy for task success rates.

Evaluation and Benchmarking AI Safety Research Agentic CLEAR multi-level agent evaluation LLM agents +1 more

4Hugging Face Blog·1mo ago·source ↗

Expert Support Case Study: Bolstering a RAG App with LLM-as-a-Judge

Hugging Face published a case study describing how Digital Green used an LLM-as-a-Judge approach to evaluate and improve a retrieval-augmented generation (RAG) application. The post covers the methodology for using LLMs to score and validate RAG outputs, providing a practical deployment pattern for quality assurance in production AI systems. It serves as a concrete example of enterprise-grade evaluation pipelines built on top of RAG architectures.

Evaluation and Benchmarking Enterprise Deployment Patterns LLM-as-a-Judge Digital Green Hugging Face +2 more

4Mistral Ai News·1mo ago·source ↗

Mistral AI: Using LLM-as-a-Judge with Structured Outputs for RAG Evaluation

Mistral AI published a technical guide on evaluating Retrieval-Augmented Generation (RAG) systems using the 'LLM as a Judge' paradigm combined with their structured outputs API feature. The approach implements the RAG Triad framework—context relevance, groundedness, and answer relevance—using Pydantic schemas to enforce machine-readable evaluation outputs. Mistral models serve as both the generator and judge components, enabling scalable automated evaluation without human annotators.

Evaluation and Benchmarking Enterprise Deployment Patterns Mistral AI RAG Triad Mistral Structured Outputs API +4 more

7arXiv · cs.CL·29d ago·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.

Frontier Model Releases Evaluation and Benchmarking NeurIPS Auto Benchmark Audit (ABA)SWE-Bench Verified +2 more

6arXiv · cs.AI·12d ago·source ↗

AgentBeats: Standardized Agent Evaluation via A2A and MCP Protocols

A new arXiv preprint proposes Agentified Agent Assessment (AAA), a framework where evaluation is performed by judge agents interacting through standardized protocols—A2A for task management and MCP for tool access—rather than bespoke benchmark harnesses. The authors introduce AgentBeats as a concrete implementation, validated through a five-month open competition with 298 judge agents and 467 subject agents across 12 categories, plus a coding-agent case study. The work addresses fragmentation in agent evaluation by decoupling assessment logic from agent implementation, enabling reproducible and interoperable benchmarking.

Evaluation and Benchmarking Agent and Tool Ecosystem AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility AgentBeats MCP +1 more

6arXiv · cs.CL·7d ago·source ↗

RubricsTree: Scalable hierarchical rubric framework for evaluating personal health AI agents

RubricsTree is a new evaluation framework for LLM-powered personal health agents, built around a hierarchical taxonomy of over 100 clinically-verifiable Boolean rubrics derived from 4,000 real user queries and curated with physician oversight. A context-aware router activates only relevant rubrics per query, enabling scalable yet expert-aligned evaluation. The framework outperforms strong LLM-as-a-judge baselines on expert alignment and, when used as training signal, yields up to ~66% relative gains on HealthBench across Gemini, GPT, and Qwen model families. The work addresses a concrete bottleneck in clinical deployment of health AI: the cost-quality tradeoff in evaluation.

Evaluation and Benchmarking AI Safety Research HealthBench RubricsTree Qwen +2 more

6arXiv · cs.AI·16d ago·source ↗

AARRI-Bench evaluates frontier LLMs and agents on granular research-intern-level tasks

Researchers introduce AARR (Act As a Real Researcher), a new benchmark series targeting whether AI agents can emulate the professionalism, thoroughness, and nuanced judgment of human researchers in granular research scenarios—not just macro-level task execution. The first benchmark, AARRI-Bench, tests frontier models and agentic harnesses, finding that even the best configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3% success, frequently missing subtle but critical details obvious to human researchers. The work argues that closing the gap requires deeper modeling of research behavior rather than more complex scaffolding.

Evaluation and Benchmarking Agent and Tool Ecosystem Claude Opus 4.6 SWE-bench AARRI-Bench +2 more

4arXiv · cs.CL·5d ago·source ↗

Mechanistic analysis of how LLMs encode essay quality in internal representations

Researchers systematically probe the hidden representations of eight LLMs across three essay datasets (ASAP++, CSEE, ENEM) to understand how automated essay scoring (AES) works internally. Using linear probing, dimensionality reduction, and neuron-level analysis, they find essay quality is encoded in a linearly accessible form that emerges progressively across layers and partially transfers across prompts. Individual 'essay scoring neurons' are identified whose activations correlate with scores and respond to targeted interventions, with longer essays relying more on deeper layers. The work contributes to mechanistic interpretability of LLM-based scoring systems.

Evaluation and Benchmarking From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models CSEE ENEM +1 more