5arXiv cs.LG (Machine Learning)·22d ago

Language Generation in the Limit with Bounded Memory: Characterization via Sperner's Theorem

This paper studies language generation in the limit under bounded memory constraints, extending classical learning theory to the generation setting. The authors characterize when memoryless generation is possible, derive minimax density bounds using Sperner's theorem and symmetric chain decompositions, and show that adaptively chosen memory outperforms sliding-window memory. They also revisit incremental identification in the limit, finding that exact identification fails for collections of three or more languages but an approximate relaxation is achievable for all finite collections.

Evaluation and Benchmarking AI Safety Research Sperner's Theorem Language Generation in the Limit Identification in the Limit Minimax Density Symmetric Chain Decompositions

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.LG·25d ago·source ↗

Self-Generated Replay Nearly Eliminates Catastrophic Forgetting in Language Models

This paper investigates catastrophic forgetting in language models during continual learning, finding that models can use self-generated samples from their own training distribution as effective replay data, nearly eliminating forgetting without requiring stored exemplars. The authors identify two key conditions where forgetting persists: when models are pretrained near capacity saturation (leaving no room for new knowledge), and when low learning rates are used to reduce forgetting at the cost of requiring far more training steps. Self-generated replay breaks this learning-rate/forgetting tradeoff, enabling fast high-learning-rate finetuning without degradation on prior tasks.

Enterprise Deployment Patterns Agent and Tool Ecosystem catastrophic forgetting Language Model Finetuning Continual Learning +2 more

6arXiv · cs.CL·1mo ago·source ↗

Mem-π: Adaptive Memory for LLM Agents via On-Demand Generation and Decoupled RL

Mem-π introduces a framework where a dedicated language or vision-language model generates context-specific guidance for LLM agents on demand, rather than retrieving static entries from episodic memory banks. The system is trained with a decision-content decoupled reinforcement learning objective that jointly learns when to generate guidance and what to generate, enabling abstention when generation would not help. Evaluated across web navigation, terminal-based tool use, and text-based embodied interaction benchmarks, Mem-π achieves over 30% relative improvement on web navigation tasks compared to retrieval-based and prior RL-optimized memory baselines.

Evaluation and Benchmarking Agent and Tool Ecosystem web navigation benchmark Mem-π large language model agents +3 more

6arXiv · cs.CL·22d ago·source ↗

Parametric Memory Law for LoRA Finetuning: Quantifying LLM Memory Capacity

This paper introduces the Parametric Memory Law, a power-law relationship linking loss reduction to effective parameters and sequence length during LoRA-based LLM finetuning. The authors identify a phase transition at the token level where prediction probability p > 0.5 constitutes a sufficient condition for verbatim recall under greedy decoding. Building on these findings, they propose MemFT, a threshold-guided optimization strategy that dynamically reallocates training budget toward sub-threshold tokens, improving memory fidelity and efficiency.

Evaluation and Benchmarking Agent and Tool Ecosystem large language models MemFT Parametric Memory Law +3 more

5arXiv · cs.CL·2d ago·source ↗

Large Language Gibbs: MCMC-based structured probabilistic inference using LLM conditionals

Researchers propose Large Language Gibbs, a structured inference scheme that uses an LLM's conditional token distributions as transition operators in a Gibbs sampling (MCMC) loop, iteratively resampling individual variables rather than generating outputs in a single autoregressive pass. The approach targets order-dependent biases in standard generation and aims to produce a stationary distribution reflecting a coherent compromise across all local conditionals. It is evaluated on synthetic distributions, consistent reasoning tasks, and Bayesian structure learning, showing MCMC-based inference is a practical alternative to one-pass generation for structured probabilistic tasks.

Evaluation and Benchmarking Inference Economics Gibbs Sampling Large Language Gibbs

7arXiv · cs.LG·26d ago·source ↗

Shannon Scaling Law: A Noisy-Channel Framework for LLM Capacity and Non-Monotonic Training Phenomena

Researchers propose the Shannon Scaling Law, a theoretical framework that models LLM training as information transmission over a noisy channel using the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, the framework introduces a fundamental SNR-based capacity limit that explains non-monotonic phenomena like catastrophic overtraining and quantization-induced degradation that classical power-law scaling laws cannot capture. Validated on Pythia and OLMo2 under Gaussian noise, quantization, and fine-tuning perturbations, the law achieves strong R² scores and successfully extrapolates from 6.9B to 12B parameter models trained on up to 307B tokens. The framework outperforms both classical and perturbation-aware scaling laws, predicting U-shaped performance degradation when SNR is insufficient.

Training Infrastructure Evaluation and Benchmarking Shannon-Hartley Theorem Shannon Scaling Law Pythia +5 more

6arXiv · cs.CL·29d ago·source ↗

Hyperfitting Explained: Terminal Geometric Expansion in Final Transformer Layers Drives Diversity Gains

This paper investigates the 'hyperfitting' phenomenon—where fine-tuning LLMs to near-zero loss on small datasets improves open-ended generation and reduces repetition—and demonstrates it is mechanistically distinct from temperature scaling. Entropy-matched control experiments falsify both the temperature-equivalence and static vocabulary reweighting hypotheses, instead localizing the effect to a 'Terminal Expansion' in the final transformer block where feature-space dimensionality expands by ~80.8 dimensions, enabling promotion of deep-tail tokens via context-dependent rank reordering. The authors introduce Late-Stage LoRA, a targeted fine-tuning strategy updating only the final 5 layers, achieving robust generation with minimal parameter updates.

Inference Economics Alignment and RLHF Terminal Expansion large language models temperature scaling +3 more

6Hacker News·25d ago·source ↗

A Sleep-Like Consolidation Mechanism for LLMs

A preprint on arXiv proposes a sleep-like memory consolidation mechanism for large language models, drawing an analogy to biological sleep-based memory consolidation in neural systems. The work appears to address how LLMs might better retain and integrate new information over time, a key challenge in continual learning and knowledge updating. The paper attracted notable community attention on Hacker News with 164 points and 122 comments, suggesting broad interest in the approach.

Frontier Model Releases Alignment and RLHF ArXiv A sleep-like consolidation mechanism for LLMs

5arXiv · cs.LG·17d ago·source ↗

Sleep paradigm for LLMs enables continual learning and memory consolidation via distillation and RL

A new arXiv preprint proposes a 'Sleep' paradigm for language models that enables continual learning by consolidating short-term in-context memories into long-term parameters. The framework has two stages: Knowledge Seeding (distilling a smaller model's memories into a larger network via on-policy distillation combined with RL-based imitation learning) and Dreaming (self-improvement via RL-generated synthetic curricula without human supervision). Experiments cover long-horizon tasks, continual learning, knowledge incorporation, and few-shot generalization, addressing a known weakness of current LLMs in retaining temporal knowledge across contexts.

Alignment and RLHF Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories Knowledge Seeding Generalized Distillation