Entity · benchmark

GSM-Symbolic

benchmarkactivegsm-symbolic-27106630·1 events·first seen May 28, 2026

Aliases: GSM-Symbolic

Co-occurring entities

Mirzadeh et al. 2025 Generalised Linear Mixed Models GSM8K

More like this (12)

GSM8K GSME SimSD SymbolicLight V1 SD-GPS FigSIM SGSD Symbolic Geometric Agent symbolic verifier outputs Ericsson System Level Synthesis MobileWorld

Recent events (1)

6arXiv · cs.CL·May 28, 2026·source ↗

Statistical Re-evaluation of GSM-Symbolic Finds Benchmark Confounds and Overstated Reasoning Conclusions

A re-evaluation of the GSM-Symbolic benchmark (Mirzadeh et al., 2025) challenges its conclusion that LLMs lack genuine reasoning capabilities. Using Generalised Linear Mixed Models on 20 open-weight models, the authors find only half show statistically significant performance drops, and identify a previously unacknowledged distributional shift toward larger integers in GSM-Symbolic relative to GSM8K that accounts for significance in roughly half the remaining cases. After controlling for this confound, model-specific failure profiles emerge—including variable binding fragility, arithmetic limitations, and dual-task interference—suggesting the original blanket claims about LLM reasoning were both statistically premature and mechanistically misleading.

Frontier Model Releases Evaluation and Benchmarking Mirzadeh et al. 2025 Generalised Linear Mixed Models GSM-Symbolic +1 more