Entity · paper

Mirzadeh et al. 2025

paperactivemirzadeh-et-al-2025-c079887e·1 events·first seen May 28, 2026

Aliases: Mirzadeh et al. 2025

Co-occurring entities

Generalised Linear Mixed Models GSM-Symbolic GSM8K

More like this (12)

Huang et al. 2025 Dravid et al., 2023 Wang et al. 2024 Eloundou et al.Yasmin Razavi Bechiri and Lanasri [2026]MediaEval Medico 2025 Sepahr-Danesh Alireza Rezvani EC 2025 HIPE-2026 NeurIPS 2025

Recent events (1)

6arXiv · cs.CL·May 28, 2026·source ↗

Statistical Re-evaluation of GSM-Symbolic Finds Benchmark Confounds and Overstated Reasoning Conclusions

A re-evaluation of the GSM-Symbolic benchmark (Mirzadeh et al., 2025) challenges its conclusion that LLMs lack genuine reasoning capabilities. Using Generalised Linear Mixed Models on 20 open-weight models, the authors find only half show statistically significant performance drops, and identify a previously unacknowledged distributional shift toward larger integers in GSM-Symbolic relative to GSM8K that accounts for significance in roughly half the remaining cases. After controlling for this confound, model-specific failure profiles emerge—including variable binding fragility, arithmetic limitations, and dual-task interference—suggesting the original blanket claims about LLM reasoning were both statistically premature and mechanistically misleading.

Frontier Model Releases Evaluation and Benchmarking Mirzadeh et al. 2025 Generalised Linear Mixed Models GSM-Symbolic +1 more