6arXiv cs.LG (Machine Learning)·8d ago

Task exchangeability framework enables statistically valid inference from synthetic data

A new arXiv preprint proposes a statistical framework for using synthetic data in scientific research with provable validity guarantees, centered on a condition called 'task exchangeability.' The framework requires identifying historical tasks with real data that are exchangeable with the current task of interest, enabling valid inference even when synthetic data is biased or misspecified. The authors demonstrate the approach on LLM-generated 'silicon samples' for public opinion surveys and LLM-as-a-judge AI evaluation settings. This addresses a foundational concern about the reliability of synthetic data pipelines increasingly used across AI evaluation and scientific research.

Evaluation and Benchmarking AI Safety Research Valid Inference with Synthetic Data via Task Exchangeability task exchangeability

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·29d ago·source ↗

SynAE: Framework for Evaluating Synthetic Data Quality in Tool-Calling Agent Benchmarks

SynAE is a proposed evaluation framework for measuring how well synthetic datasets replicate and augment real data trajectories for multi-turn, tool-calling agent testing. It assesses validity, fidelity, and diversity across four metric categories: task instructions, tool calls, final outputs, and downstream evaluation. The paper demonstrates that no single metric suffices to characterize synthetic data quality, motivating multi-axis evaluation. A demo and code are publicly available.

Evaluation and Benchmarking Agent and Tool Ecosystem multi-turn agent benchmarks tool-calling agents SynAE +1 more

6arXiv · cs.AI·4d ago·source ↗

Causal auditing framework detects privacy disclosures in synthetic data without model access

A new arXiv preprint introduces a model-agnostic empirical framework for auditing synthetic data generated by LLMs and generative AI systems for privacy leakage. The framework distinguishes 'true disclosures' (direct reproduction of user data) from 'phantom disclosures' (incidental generation), using held-out control sets and statistical hypothesis testing without requiring model access, canary insertion, or shadow model training. It functions as a membership inference attack and provides empirical lower bounds on privacy leakage that are tighter than prior data-based auditing methods. The approach is computationally lightweight and applicable to any synthetic data generation mechanism.

Evaluation and Benchmarking AI Safety Research Differential Privacy Phantoms and Disclosures: a Causal Framework for Auditing Synthetic Data

4arXiv · cs.CL·5d ago·source ↗

Synthetic data generation method enables small LLMs to match large models on Text-To-Cypher tasks

A new arXiv paper presents an automatic synthetic data generation method for fine-tuning small LLMs on Text-To-Cypher (Text2Cypher) parsing, enabling natural language interfaces to property graph databases. Experiments across major Text-To-Cypher benchmarks show that small fine-tuned models can compete with much larger proprietary models. The approach is positioned as a solution for local deployment scenarios requiring data sovereignty without expensive annotation.

Evaluation and Benchmarking Enterprise Deployment Patterns Cypher Achieving Precise Text-To-Cypher Via Grounded Knowledge Graph Data Generation

4Ai Snake Oil·1mo ago·source ↗

Can AI automate computational reproducibility?

This commentary introduces a new benchmark aimed at measuring AI's ability to automate computational reproducibility in scientific research. The piece examines whether AI systems can reliably re-execute and validate scientific computations, a key bottleneck in research integrity. It frames reproducibility automation as a concrete, measurable capability for evaluating AI's impact on science.

Evaluation and Benchmarking Agent and Tool Ecosystem Normal Tech / AI Snake Oil AI Reproducibility Benchmark

6arXiv · cs.AI·4d ago·source ↗

Bayesian audit framework for public AI evaluation archives challenges frontier model claims

A new arXiv preprint proposes a Bayesian inference and decision-audit framework for interpreting public AI evaluation archives (LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, tau-bench) as longitudinal time series rather than terminal leaderboards. The paper demonstrates that a single terminal snapshot is compatible with multiple distinct performance histories, yielding ambiguous timing estimates for reaching capability ceilings. A candidate selection-aware frontier model is shown to fail synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration, with fixed audit gates rejecting its stronger claims. The work proposes an archive-and-adjudication protocol to reconstruct evaluation histories and falsify unsupported frontier capability claims.

Evaluation and Benchmarking AI Safety Research Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations GAIA Open LLM Leaderboard +3 more

7arXiv · cs.LG·1mo ago·source ↗

Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

This paper introduces Equilibrium Reasoners (EqR), a framework that formalizes test-time compute scaling through learned task-conditioned attractors in latent space, where stable fixed points correspond to valid solutions. EqR scales along two axes—depth (more iterations) and breadth (aggregating stochastic trajectories)—without requiring external verifiers or task-specific priors. On Sudoku-Extreme, unrolling up to 40,000 equivalent layers boosts accuracy from 2.6% (feedforward baseline) to over 99%. The work provides a mechanistic lens for understanding why iterative latent models generalize beyond memorized patterns.

Long Context Evolution Evaluation and Benchmarking task-conditioned attractors latent dynamical systems Sudoku-Extreme +3 more

3arXiv · cs.LG·11d ago·source ↗

LLM-augmented XAI framework with mutual feature interactions for network operations

A new arXiv paper proposes a framework combining LLMs with SHAP-based explainability, augmented by mutual feature interaction data, to generate natural language explanations for AI/ML models used in network operations. The approach is validated on an optical quality-of-transmission estimation task with human evaluators, showing 12.2% and 6.2% improvements in explanation usefulness and scope over a SHAP-only baseline, with 97.5% correctness. The work targets the gap between technical XAI outputs and actionable insights for non-specialist network operators.

Evaluation and Benchmarking Generative Explainability for Next-Generation Networks: LLM-Augmented XAI with Mutual Feature Interactions SHapley Additive exPlanations Generative Explainability for Next-Generation Networks: LLM-Augmented XAI with Mutual Feature Interactions

5arXiv · cs.CL·3d ago·source ↗

Benchmark gap paper: EU AI Act requires doctrinal legal reasoning evals that don't yet exist

A new arXiv preprint identifies a critical measurement gap in legal AI evaluation: existing benchmarks test paralegal and ancillary tasks rather than doctrinal legal reasoning, which is the interpretive core of legal work. The authors argue this gap is not merely methodological but legally significant, because the EU AI Act's 'appropriate accuracy' requirement for high-risk AI in the judicial domain cannot be operationalized without a doctrinal-reasoning benchmark. The paper proposes a benchmark framework aimed at filling this gap under EU AI Act compliance requirements.

Evaluation and Benchmarking Regulatory Developments The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act EU AI Act