Almanac
paper

How reliable are LLMs when it comes to playing dice?

paperactiveprovisionalhow-reliable-are-llms-when-it-comes-to-playing-dice--e576728e·1 events·first seen 9d ago

Aliases: How reliable are LLMs when it comes to playing dice?

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.AI·9d ago·source ↗

Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance

A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.