Entity · benchmark

all-edge F1

benchmarkactiveall-edge-f1-413c0941·1 events·first seen May 26, 2026

Aliases: all-edge F1

Co-occurring entities

GPT-5.2-high causal discovery CausaLab Structural Causal Model

More like this (12)

Evidence-Grounded F1 BERT-F1 Falcon-Edge RadGraph F1 CheXbert F1 MiniF2F FSE MiniF2F-Test Macro-F1 Cosmos 3 Edge alignment faking FastMCP

Recent events (1)

6arXiv · cs.CL·May 26, 2026·source ↗

CausaLab: Scalable Benchmark for Interactive Causal Discovery by LLM Agents

CausaLab is a new evaluation environment that tests LLM agents on interactive causal discovery tasks, requiring them to recover both causal graphs and structural equations from synthetic laboratory episodes governed by randomly sampled structural causal models (SCMs). The benchmark separates predictive accuracy from genuine causal understanding, revealing a persistent gap: GPT-5.2-high achieves 92% task accuracy in a 6-node observational setting but only 0.471 all-edge F1 for mechanism recovery. Mixed observation-intervention strategies improve structural fidelity, while pure intervention strategies underperform on both metrics. Premature stopping is identified as a key agent weakness, partially mitigated by prompting models to verify hypothesis-data consistency.

Evaluation and Benchmarking AI Safety Research all-edge F1 GPT-5.2-high causal discovery +3 more