Mirzadeh et al. 2025
mirzadeh-et-al-2025-c079887e·1 events·first seen 20d agoAliases: Mirzadeh et al. 2025
Co-occurring entities
More like this (12)
Recent events (1)
Statistical Re-evaluation of GSM-Symbolic Finds Benchmark Confounds and Overstated Reasoning Conclusions
A re-evaluation of the GSM-Symbolic benchmark (Mirzadeh et al., 2025) challenges its conclusion that LLMs lack genuine reasoning capabilities. Using Generalised Linear Mixed Models on 20 open-weight models, the authors find only half show statistically significant performance drops, and identify a previously unacknowledged distributional shift toward larger integers in GSM-Symbolic relative to GSM8K that accounts for significance in roughly half the remaining cases. After controlling for this confound, model-specific failure profiles emerge—including variable binding fragility, arithmetic limitations, and dual-task interference—suggesting the original blanket claims about LLM reasoning were both statistically premature and mechanistically misleading.