Entity · benchmark

CAGE

benchmarkactivecage-42950a3d·1 events·first seen May 28, 2026

Aliases: CAGE

Co-occurring entities

Pearl's Causal Hierarchy Text-Only Probe Chain-Text Probe Abstraction Gap

More like this (12)

CRAG CATT CybORG CAGE-2 CADE CAISI CUGA COGENT CARV CISPO CEM CRAM CAVE-ABSA

Recent events (1)

6arXiv · cs.CL·May 28, 2026·source ↗

The Abstraction Gap in Vision-Language Causal Reasoning

Researchers introduce a dual-probe methodology and the CAGE benchmark (49,500 questions across 5,500 images) to distinguish linguistic plausibility from faithful causal reasoning in vision-language models. An Abstraction Gap (AG) metric quantifies the normalized performance difference between text-only and chain-of-reasoning probes. Evaluating eight VLMs, seven exhibit AG exceeding 0.50—generating fluent causal text but failing structured causal chain tasks—while one model achieves near-zero AG, suggesting architectural and pretraining choices are decisive. Fine-tuning on 45,000 chain-annotated examples fails to close the gap, pointing to a fundamental capability distinction.

Evaluation and Benchmarking Agent and Tool Ecosystem Pearl's Causal Hierarchy CAGE Text-Only Probe +3 more