paper

How reliable are LLMs when it comes to playing dice?

paperactiveprovisionalhow-reliable-are-llms-when-it-comes-to-playing-dice--c116aeb0·1 events·first seen 9d ago

Aliases: How reliable are LLMs when it comes to playing dice?

Co-occurring entities

More like this (12)

How reliable are LLMs when it comes to playing dice?LLM inference LLM-judge scoring LLM-judged explanation score LLM evaluation LLM-as-a-Judge frontier LLMs long-context LLMs LLM Safety Leaderboard LLM agents Fast-dLLM Flaws in the LLM Automation Narrative

Recent events (1)

5arXiv · cs.AI·9d ago·source ↗

Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance

A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.

Evaluation and Benchmarking AI Safety Research How reliable are LLMs when it comes to playing dice?How reliable are LLMs when it comes to playing dice?