Almanac
benchmark

GSM-Symbolic

benchmarkactiveprovisionalgsm-symbolic-27106630·1 events·first seen 20d ago

Aliases: GSM-Symbolic

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·20d ago·source ↗

Statistical Re-evaluation of GSM-Symbolic Finds Benchmark Confounds and Overstated Reasoning Conclusions

A re-evaluation of the GSM-Symbolic benchmark (Mirzadeh et al., 2025) challenges its conclusion that LLMs lack genuine reasoning capabilities. Using Generalised Linear Mixed Models on 20 open-weight models, the authors find only half show statistically significant performance drops, and identify a previously unacknowledged distributional shift toward larger integers in GSM-Symbolic relative to GSM8K that accounts for significance in roughly half the remaining cases. After controlling for this confound, model-specific failure profiles emerge—including variable binding fragility, arithmetic limitations, and dual-task interference—suggesting the original blanket claims about LLM reasoning were both statistically premature and mechanistically misleading.