paper
How reliable are LLMs when it comes to playing dice?
paperactiveprovisional
how-reliable-are-llms-when-it-comes-to-playing-dice--c116aeb0·1 events·first seen 9d agoAliases: How reliable are LLMs when it comes to playing dice?
Co-occurring entities
More like this (12)
Recent events (1)
Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance
A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.