The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans
the-riddle-riddle-testing-flexible-reasoning-in-large-language-models-and-humans-372e1d11·1 events·first seen 3d agoAliases: The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans
More like this (12)
Recent events (1)
Riddle riddle paradigm reveals LLMs rely on pattern matching rather than flexible reasoning
Researchers introduce the 'riddle riddle' paradigm — word problems that mimic riddle structure but require only literal interpretation — to test whether LLMs reason flexibly or match surface patterns. Across nine state-of-the-art LLMs and 100 human participants, LLMs performed well on genuine riddles (84.9%) but poorly on riddle riddles (50.7%), while humans showed the reverse pattern. Error analysis found 90.8% of LLM failures stemmed from inappropriate inventive reasoning, suggesting LLM success on genuine riddles reflects memory retrieval rather than flexible strategy selection. The findings caution against conflating outputs that look like reasoning with genuine reasoning.