Entity · paper

How reliable are LLMs when it comes to playing dice?

paperactivehow-reliable-are-llms-when-it-comes-to-playing-dice--e576728e·1 events·first seen Jun 8, 2026

Aliases: How reliable are LLMs when it comes to playing dice?

Merged from

More like this (12)

StreamingLLM LLM-as-a-Verifier LLM inference Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?Can LLMs Reliably Self-Report Adversarial Prefills, and How?LLM-judge scoring LLM-judged explanation score When the Judge Changes, So Does the Measurement: Auditing LLM-as-Judge Reliability Will Scaling Improve Social Simulation with LLMs?LLM evaluation LLM-as-a-Judge frontier LLMs

Recent events (1)

5arXiv · cs.AI·Jun 8, 2026·source ↗

Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance

A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.

Evaluation and Benchmarking AI Safety Research How reliable are LLMs when it comes to playing dice?