6arXiv cs.LG (Machine Learning)·17d ago

Rosetta Neurons follow sublinear power-law scaling with model size, becoming more monosemantic at scale

A new arXiv paper investigates how neuron populations evolve with scale in both language models (up to 30B parameters) and vision models (up to 5B parameters), focusing on 'Rosetta Neurons' — neurons with similar activation patterns across independently trained models. The authors find Rosetta Neurons grow in absolute count but shrink as a fraction of total neurons, and exhibit a 'Neuron Polarization Effect' where they become increasingly monosemantic while non-Rosetta neurons remain less selective. An analytical model explains the sublinear power-law scaling, and the paper demonstrates practical utility via a targeted data-filtering case study for continued pretraining. The results extend scaling laws to neuron-level interpretability structure, linking model size to systematic changes in universality and specialization.

Evaluation and Benchmarking AI Safety Research Rosetta Neurons Neuron Populations Exhibit Divergent Selectivity with Scale Dravid et al., 2023

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

9Openai Blog·1mo ago·source ↗

Scaling Laws for Neural Language Models

OpenAI published foundational research establishing empirical scaling laws for neural language models, showing that model performance scales predictably with compute, data, and parameters. The work demonstrated power-law relationships between these factors and loss, providing a principled framework for allocating training resources. This paper became a cornerstone of modern large language model development strategy.

Training Infrastructure Frontier Model Releases Jared Kaplan Sam McCandlish OpenAI +3 more

6Openai Blog·1mo ago·source ↗

Language models can explain neurons in language models

OpenAI uses GPT-4 to automatically generate and score natural-language explanations for the behavior of individual neurons in large language models. The methodology is applied to all neurons in GPT-2, producing a public dataset of explanations and quality scores. The authors acknowledge the explanations are imperfect, framing this as an early step toward automated mechanistic interpretability. This work establishes a scalable pipeline for neuron-level analysis that could inform future interpretability and safety research.

Evaluation and Benchmarking AI Safety Research GPT-2 automated mechanistic interpretability neuron explanation dataset +2 more

6arXiv · cs.CL·29d ago·source ↗

Hyperfitting Explained: Terminal Geometric Expansion in Final Transformer Layers Drives Diversity Gains

This paper investigates the 'hyperfitting' phenomenon—where fine-tuning LLMs to near-zero loss on small datasets improves open-ended generation and reduces repetition—and demonstrates it is mechanistically distinct from temperature scaling. Entropy-matched control experiments falsify both the temperature-equivalence and static vocabulary reweighting hypotheses, instead localizing the effect to a 'Terminal Expansion' in the final transformer block where feature-space dimensionality expands by ~80.8 dimensions, enabling promotion of deep-tail tokens via context-dependent rank reordering. The authors introduce Late-Stage LoRA, a targeted fine-tuning strategy updating only the final 5 layers, achieving robust generation with minimal parameter updates.

Inference Economics Alignment and RLHF Terminal Expansion large language models temperature scaling +3 more

7arXiv · cs.LG·26d ago·source ↗

Shannon Scaling Law: A Noisy-Channel Framework for LLM Capacity and Non-Monotonic Training Phenomena

Researchers propose the Shannon Scaling Law, a theoretical framework that models LLM training as information transmission over a noisy channel using the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, the framework introduces a fundamental SNR-based capacity limit that explains non-monotonic phenomena like catastrophic overtraining and quantization-induced degradation that classical power-law scaling laws cannot capture. Validated on Pythia and OLMo2 under Gaussian noise, quantization, and fine-tuning perturbations, the law achieves strong R² scores and successfully extrapolates from 6.9B to 12B parameter models trained on up to 307B tokens. The framework outperforms both classical and perturbation-aware scaling laws, predicting U-shaped performance degradation when SNR is insufficient.

Training Infrastructure Evaluation and Benchmarking Shannon-Hartley Theorem Shannon Scaling Law Pythia +5 more

7Openai Blog·1mo ago·source ↗

Scaling Laws for Reward Model Overoptimization

OpenAI published research investigating how reward model overoptimization scales with policy and reward model size in RLHF pipelines. The work characterizes the relationship between KL divergence from the initial policy and gold-standard reward, finding predictable degradation patterns as optimization pressure increases. This provides empirical grounding for understanding Goodhart's Law dynamics in language model fine-tuning and has implications for designing safer, more robust RLHF training regimes.

Evaluation and Benchmarking AI Safety Research KL Divergence Goodhart's Law Scaling Laws for Reward Model Overoptimization +3 more

7arXiv · cs.LG·1mo ago·source ↗

Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

This paper introduces Equilibrium Reasoners (EqR), a framework that formalizes test-time compute scaling through learned task-conditioned attractors in latent space, where stable fixed points correspond to valid solutions. EqR scales along two axes—depth (more iterations) and breadth (aggregating stochastic trajectories)—without requiring external verifiers or task-specific priors. On Sudoku-Extreme, unrolling up to 40,000 equivalent layers boosts accuracy from 2.6% (feedforward baseline) to over 99%. The work provides a mechanistic lens for understanding why iterative latent models generalize beyond memorized patterns.

Long Context Evolution Evaluation and Benchmarking task-conditioned attractors latent dynamical systems Sudoku-Extreme +3 more

7Openai Blog·1mo ago·source ↗

Deep Double Descent: Universal Phenomenon in CNNs, ResNets, and Transformers

OpenAI researchers demonstrate that the double descent phenomenon—where model performance improves, degrades, then improves again—occurs universally across CNNs, ResNets, and transformers as a function of model size, data size, or training time. The effect can often be masked by careful regularization, which may explain why it has been underappreciated. The underlying mechanism remains poorly understood, and the authors identify it as an important open research direction.

Frontier Model Releases Evaluation and Benchmarking Transformers Deep Double Descent CNN +2 more

6arXiv · cs.CL·17d ago·source ↗

Dynamic short convolutions yield 1.33–1.60× compute advantage over standard Transformers

A new arXiv preprint introduces dynamic short convolutions as an architectural primitive for Transformers, using input-dependent filters to combine locality bias with increased expressivity. Experiments across 150M–2B parameter language models show consistent perplexity improvements over standard Transformers and static convolution variants, with scaling-law fits indicating a 1.33× compute advantage when applied to key/query/value vectors and 1.60× when added after every linear layer. The technique also improves linear RNNs (Mamba-2, Gated DeltaNet) and mixture-of-experts architectures, with custom Triton kernels making training practical.

Training Infrastructure Frontier Model Releases Triton Mamba Gated DeltaNet-2 +1 more