Almanac
product

LAMBDA

productactiveprovisionallambda-64783dfd·1 events·first seen 22h ago

Aliases: LAMBDA

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.AI·22h ago·source ↗

Grading the Grader: Evaluating automated graders for agentic data analysis systems

A preprint from arXiv investigates the reliability of automated graders for evaluating agentic data analysis systems, which produce complex multi-modal outputs (code, numerical results, diagnostics) that are harder to assess than single-turn LLM responses. The authors apply LAMBDA, a multi-agent data analysis system, to 153 numerical tasks from DSGym and develop a three-layer human-AI grading cascade combining regex matching, LLM-based lenient grading, and human inspection. Key findings include: both automated graders achieve 100% precision, a keyword-anchored extraction pipeline raises strict grader recall by 60 percentage points, and an iterative nudge mechanism raises grading success from 36% to 97%. The work surfaces important methodological lessons for anyone building evaluation pipelines for agentic systems.