5arXiv cs.LG (Machine Learning)·7d ago

Study finds sequence probability predicts correctness across datasets but not within decoding decisions

A new arXiv paper investigates when sequence probability — the conditional probability of a model's output given a prompt — actually correlates with correctness in LLMs. The authors analyze this relationship across decoding methods, hyperparameters, prompt-answer pairs, and repeated responses, finding that higher sequence probability predicts correctness across dataset items but does not reliably transfer to decoding decisions or same-prompt repeated sampling. The findings have direct implications for the validity of decoding strategies, self-consistency methods, and verifier-free self-improvement approaches.

Evaluation and Benchmarking Inference Economics When are likely answers right? On Sequence Probability and Correctness in LLMs

Related guides (2)

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Hidden Cost Battle Shaping AI

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·25d ago·source ↗

Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance

A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.

Evaluation and Benchmarking AI Safety Research How reliable are LLMs when it comes to playing dice?

5arXiv · cs.CL·22d ago·source ↗

Systematic study reveals effectiveness-fluency trade-offs in LLM conditioning methods

A new arXiv paper systematically evaluates a range of LLM conditioning methods across both concept injection and removal scenarios, finding that efficient steering methods often degrade fluency significantly. A key finding is that activation steering is substantially less effective on instruction-tuned models than on base models, a previously overlooked interaction. Simple prompting and supervised fine-tuning work for concept injection but not removal, and cheap textual metrics are found to correlate well with expensive LLM-as-judge evaluations.

Evaluation and Benchmarking Alignment and RLHF On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study

6arXiv · cs.LG·10d ago·source ↗

PAC-Bayes analysis establishes formal expressivity and alignment floors for prompt-conditioned LLMs

A new arXiv preprint models user-LLM interaction as a bilevel cheap-talk game and derives PAC-Bayes bounds showing two irreducible limitations: an 'expressivity floor' where language's finite channel capacity makes distinct tasks indistinguishable, and an 'objective-misalignment floor' where alignment constraints prevent reaching user-ideal outputs. The authors prove that prompt-conditioned LLMs cannot be universal problem solvers, as correct behavior on certain task families is provably unattainable even with infinite data, optimal training, or model scaling. The work suggests multimodal inputs and external memory as potential mitigations by increasing task-relevant information bandwidth.

Evaluation and Benchmarking Alignment and RLHF PAC-Bayes On the Limits of Prompt-Conditioned Language Models as General-Purpose Learners

6arXiv · cs.AI·21d ago·source ↗

Study finds shared pattern-matching mechanisms underlie both human and LLM everyday reasoning errors

A new arXiv paper evaluates human participants and 25 LLMs on commonsense causal reasoning tasks, finding similar error patterns in both groups. The authors identify specific attention heads driving LLM responses that implement pattern-matching, and show these heads can predict human reasoning errors caused by superficially irrelevant prompt details. The findings challenge the common assumption that human reasoning relies on principled abstract world models while LLMs merely pattern-match, suggesting both may share a more unified cognitive mechanism.

Evaluation and Benchmarking AI Safety Research Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning

5arXiv · cs.CL·28d ago·source ↗

Pre-registered study finds Popperian code-generation prompt skills add no benefit beyond structural scaffolding

A pre-registered two-tier ablation study tests whether 'Popperian falsificationist' prompt skills improve LLM code generation through their procedural content or merely through structural scaffolding. Using Claude Sonnet 4.6 and Qwen2.5-Coder-0.5B with execution-based evaluation (HumanEval+ unit tests) rather than LLM-as-judge, the authors find that on the small model, structured prompts lift correctness by 20-22 points but the full Popperian skill shows no separable benefit over a labels-only scaffold. The paper contributes a calibrated negative result and a reusable disambiguation protocol for evaluating prompt-skill families, while also documenting that LLM self-judges at 0.5B scale perform no better than random selection.

Evaluation and Benchmarking Claude Sonnet 4 Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill HumanEval +2 more

4arXiv · cs.CL·10d ago·source ↗

Variance-Calibrated Modulation (VCM): training-free decoding intervention to address LLM likelihood trap

Researchers propose Variance-Calibrated Modulation (VCM), a training-free pre-decoding method that reshapes LLM probability distributions before truncation to combat repetitive degeneration and vocabulary dullness. VCM combines two mechanisms: Contextual Searchlight via PMI (suppressing stopwords, elevating context-relevant tokens) and Adaptive Self-Debiasing (scale-invariant penalization using real-time logit standard deviation). Evaluated across open-ended generation, factual QA, and mathematical reasoning, VCM improves diversity, coherence, and reasoning accuracy at higher temperatures with negligible overhead. The method is compatible with existing decoding strategies like Top-p and Min-p.

Evaluation and Benchmarking Inference Economics Adaptive Self-Debiasing Variance-Calibrated Modulation Contextual Searchlight via PMI

5arXiv · cs.CL·28d ago·source ↗

PropMe framework distinguishes memorization capability from propensity in LLMs

A new arXiv preprint introduces PropMe, a framework that separates whether LLMs can be forced to reproduce training data (capability) from whether they do so under ordinary use (propensity). The authors also release SimpleTrace, a lightweight pipeline using infini-gram to attribute model outputs to training corpora. Evaluating two open models on Common Pile and Dynaword, they find a consistent gap: adversarial prefix attacks elicit strong memorization, but propensity scores remain low in non-adversarial settings. The paper argues memorization audits should report both worst-case extractability and ordinary leakage propensity.

Evaluation and Benchmarking AI Safety Research PropMe SimpleTrace Dynaword +4 more

6arXiv · cs.CL·1mo ago·source ↗

Tracing the Emergence of Human-Like Pragmatic Reasoning in LLMs Across Languages

Researchers conducted a population-matching experiment evaluating 25 LLMs on conditional inference tasks across four languages, comparing model behavior to matched human populations. The study finds that LLMs function as accurate semantic operators but systematically fail to capture pragmatic enrichments—context-sensitive inferences beyond literal logical meaning—that humans apply effortlessly. Model performance on pragmatic reasoning is not predicted by open vs. closed weights, training orientation, or architecture type, suggesting pragmatic reasoning remains an emergent and unreliable capability. The findings contribute to ongoing debates about whether LLMs reason like humans or merely approximate surface-level linguistic patterns.

Frontier Model Releases Evaluation and Benchmarking large language models Population-Matching Experiment Pragmatic Reasoning +1 more