7arXiv cs.CL (Computation and Language)·1mo ago

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

This paper establishes a quantitative scaling law linking LLM factual recall to both model parameter count and topic frequency in training data, evaluated across 38 models on 8,900+ scholarly references. Recall quality follows a sigmoid function in the log-linear combination of these two variables, explaining 60% of variance across 16 dense models from four families and 74-94% within individual families. The authors propose a superposition-inspired mechanism where recall is gated by a signal-to-noise ratio: concept frequency provides signal and model capacity sets the noise floor. This provides a predictive framework for understanding and anticipating LLM confabulation patterns.

Frontier Model Releases Evaluation and Benchmarking AI Safety Research Automated Reference Verification System Factual Recall Scaling Law Superposition Model (neural networks)Superposition-Inspired Signal-to-Noise Account

Related guides (3)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6Hacker News·25d ago·source ↗

A Sleep-Like Consolidation Mechanism for LLMs

A preprint on arXiv proposes a sleep-like memory consolidation mechanism for large language models, drawing an analogy to biological sleep-based memory consolidation in neural systems. The work appears to address how LLMs might better retain and integrate new information over time, a key challenge in continual learning and knowledge updating. The paper attracted notable community attention on Hacker News with 164 points and 122 comments, suggesting broad interest in the approach.

Frontier Model Releases Alignment and RLHF ArXiv A sleep-like consolidation mechanism for LLMs

6arXiv · cs.CL·22d ago·source ↗

Parametric Memory Law for LoRA Finetuning: Quantifying LLM Memory Capacity

This paper introduces the Parametric Memory Law, a power-law relationship linking loss reduction to effective parameters and sequence length during LoRA-based LLM finetuning. The authors identify a phase transition at the token level where prediction probability p > 0.5 constitutes a sufficient condition for verbatim recall under greedy decoding. Building on these findings, they propose MemFT, a threshold-guided optimization strategy that dynamically reallocates training budget toward sub-threshold tokens, improving memory fidelity and efficiency.

Evaluation and Benchmarking Agent and Tool Ecosystem large language models MemFT Parametric Memory Law +3 more

6arXiv · cs.CL·15d ago·source ↗

Decomposing factual sycophancy in LLMs: size and instruction tuning shape robustness differently

A new arXiv paper decomposes factual sycophancy — where a model abandons a correct answer under social pressure — into two distinct mechanisms: truth margin (baseline preference for correct answers) and manipulation sensitivity (how much pressure shifts that preference). Evaluating 56 open-weight models from 0.3B to 32B parameters across 13 manipulation types, the authors find that vulnerability is primarily governed by model size, but instruction tuning modulates how size acts: small instruction-tuned models can become less robust while large ones typically become more robust. The paper argues that flip rates alone are insufficient and that evaluations should report channel-specific, manipulation-specific, and size-conditioned metrics.

Evaluation and Benchmarking Open Weights Progress Decomposing Factual Sycophancy in Language Models: How Size and Instruction Tuning Shape Robustness +1 more

7arXiv · cs.AI·11d ago·source ↗

MIST benchmark reveals memory-augmented LLMs amplify sycophancy up to 25x over in-context baselines

Researchers introduce MIST, a benchmark of synthetically generated multi-turn conversations testing sycophancy in memory-augmented LLMs across scientific, medical, and moral reasoning domains. Evaluating three memory systems and five model families, they find persistent memory consistently amplifies sycophantic behavior — up to 25x higher rates than in-context baselines — with lossy memory extraction identified as the primary mechanism. The paper also proposes two lightweight mitigations that reduce sycophancy while maintaining or improving factual recall. This is the first systematic evaluation of how persistent memory interacts with sycophancy.

Evaluation and Benchmarking AI Safety Research Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models MIST +1 more

6Google Deepmind Blog·1mo ago·source ↗

FACTS Benchmark Suite: Systematically evaluating the factuality of large language models

DeepMind has released the FACTS Benchmark Suite, a systematic evaluation framework for measuring the factuality of large language models. The benchmark is designed to assess how accurately LLMs produce factually grounded outputs. This represents a structured contribution to the growing field of LLM evaluation, specifically targeting hallucination and factual reliability. The announcement comes from a Tier 1 lab, lending it credibility as a reference benchmark in the field.

Evaluation and Benchmarking AI Safety Research FACTS Benchmark Suite Google DeepMind

7arXiv · cs.LG·26d ago·source ↗

Shannon Scaling Law: A Noisy-Channel Framework for LLM Capacity and Non-Monotonic Training Phenomena

Researchers propose the Shannon Scaling Law, a theoretical framework that models LLM training as information transmission over a noisy channel using the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, the framework introduces a fundamental SNR-based capacity limit that explains non-monotonic phenomena like catastrophic overtraining and quantization-induced degradation that classical power-law scaling laws cannot capture. Validated on Pythia and OLMo2 under Gaussian noise, quantization, and fine-tuning perturbations, the law achieves strong R² scores and successfully extrapolates from 6.9B to 12B parameter models trained on up to 307B tokens. The framework outperforms both classical and perturbation-aware scaling laws, predicting U-shaped performance degradation when SNR is insufficient.

Training Infrastructure Evaluation and Benchmarking Shannon-Hartley Theorem Shannon Scaling Law Pythia +5 more

6arXiv · cs.CL·17d ago·source ↗

Framework for quantifying faithful confidence expression in large reasoning models

A new arXiv preprint introduces a framework to measure faithful calibration (FC) in large reasoning models (LRMs)—the alignment between a model's intrinsic confidence and its linguistically expressed confidence. The authors analyze linguistic decisiveness against three internal uncertainty sources (token probabilities, hidden states, sampled response consistency) and introduce prefix-conditioned sampling to handle structural variation in chain-of-thought traces. Applying the framework across leading models, they find FC is a significant and distinct failure mode for LRMs: extended reasoning traces do not automatically improve calibration, prompt interventions that help non-reasoning models fail in the reasoning setting, and different confidence estimators produce divergent assessments of the same traces.

Frontier Model Releases Evaluation and Benchmarking Quantifying Faithful Confidence Expression in Large Reasoning Models +2 more

5Hacker News·23d ago·source ↗

Disagreement among frontier LLMs on real-world fact-checks

A study examines how frontier large language models diverge in their responses to real-world fact-checking queries, surfacing systematic disagreements across models on factual claims. The work appears to benchmark multiple leading models against a set of verifiable facts, revealing inconsistencies that have implications for reliability and deployment. With 475 HN points and 333 comments, the piece has generated substantial community discussion. The findings are relevant to evaluation methodology, model calibration, and trust in AI-generated factual content.

Frontier Model Releases Evaluation and Benchmarking frontier LLMs lenz.io Hacker News