paper

The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans

paperactiveprovisionalthe-riddle-riddle-testing-flexible-reasoning-in-large-language-models-and-humans-372e1d11·1 events·first seen 3d ago

Aliases: The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans

More like this (12)

Reasoning Language Models Large Reasoning Models Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models Automated reproducibility assessments in the social and behavioral sciences using large language models Single and Multi Truth Data Fusion using Large Language Models Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact Civil Court Simulation with Large Language Models Quantifying Faithful Confidence Expression in Large Reasoning Models Words as Difference Makers: How Large Language Models Determine Causal Structure in Text Long-context Reasoning Benchmarks Multilingual Reasoning Cascades Need More Context

Recent events (1)

6arXiv · cs.CL·3d ago·source ↗

Riddle riddle paradigm reveals LLMs rely on pattern matching rather than flexible reasoning

Researchers introduce the 'riddle riddle' paradigm — word problems that mimic riddle structure but require only literal interpretation — to test whether LLMs reason flexibly or match surface patterns. Across nine state-of-the-art LLMs and 100 human participants, LLMs performed well on genuine riddles (84.9%) but poorly on riddle riddles (50.7%), while humans showed the reverse pattern. Error analysis found 90.8% of LLM failures stemmed from inappropriate inventive reasoning, suggesting LLM success on genuine riddles reflects memory retrieval rather than flexible strategy selection. The findings caution against conflating outputs that look like reasoning with genuine reasoning.

Evaluation and Benchmarking AI Safety Research The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans