7OpenAI Blog·1mo ago

Deep Double Descent: Universal Phenomenon in CNNs, ResNets, and Transformers

OpenAI researchers demonstrate that the double descent phenomenon—where model performance improves, degrades, then improves again—occurs universally across CNNs, ResNets, and transformers as a function of model size, data size, or training time. The effect can often be masked by careful regularization, which may explain why it has been underappreciated. The underlying mechanism remains poorly understood, and the authors identify it as an important open research direction.

Frontier Model Releases Evaluation and Benchmarking Transformers Deep Double Descent CNN OpenAI ResNet

Related guides (3)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·17d ago·source ↗

Dynamic short convolutions yield 1.33–1.60× compute advantage over standard Transformers

A new arXiv preprint introduces dynamic short convolutions as an architectural primitive for Transformers, using input-dependent filters to combine locality bias with increased expressivity. Experiments across 150M–2B parameter language models show consistent perplexity improvements over standard Transformers and static convolution variants, with scaling-law fits indicating a 1.33× compute advantage when applied to key/query/value vectors and 1.60× when added after every linear layer. The technique also improves linear RNNs (Mamba-2, Gated DeltaNet) and mixture-of-experts architectures, with custom Triton kernels making training practical.

Training Infrastructure Frontier Model Releases Triton Mamba Gated DeltaNet-2 +1 more

6arXiv · cs.CL·29d ago·source ↗

Hyperfitting Explained: Terminal Geometric Expansion in Final Transformer Layers Drives Diversity Gains

This paper investigates the 'hyperfitting' phenomenon—where fine-tuning LLMs to near-zero loss on small datasets improves open-ended generation and reduces repetition—and demonstrates it is mechanistically distinct from temperature scaling. Entropy-matched control experiments falsify both the temperature-equivalence and static vocabulary reweighting hypotheses, instead localizing the effect to a 'Terminal Expansion' in the final transformer block where feature-space dimensionality expands by ~80.8 dimensions, enabling promotion of deep-tail tokens via context-dependent rank reordering. The authors introduce Late-Stage LoRA, a targeted fine-tuning strategy updating only the final 5 layers, achieving robust generation with minimal parameter updates.

Inference Economics Alignment and RLHF Terminal Expansion large language models temperature scaling +3 more

6arXiv · cs.LG·17d ago·source ↗

Rosetta Neurons follow sublinear power-law scaling with model size, becoming more monosemantic at scale

A new arXiv paper investigates how neuron populations evolve with scale in both language models (up to 30B parameters) and vision models (up to 5B parameters), focusing on 'Rosetta Neurons' — neurons with similar activation patterns across independently trained models. The authors find Rosetta Neurons grow in absolute count but shrink as a fraction of total neurons, and exhibit a 'Neuron Polarization Effect' where they become increasingly monosemantic while non-Rosetta neurons remain less selective. An analytical model explains the sublinear power-law scaling, and the paper demonstrates practical utility via a targeted data-filtering case study for continued pretraining. The results extend scaling laws to neuron-level interpretability structure, linking model size to systematic changes in universality and specialization.

Evaluation and Benchmarking AI Safety Research Rosetta Neurons Neuron Populations Exhibit Divergent Selectivity with Scale Dravid et al., 2023

6Openai Blog·1mo ago·source ↗

AI and Efficiency: Algorithmic Progress Halving Training Compute Every 16 Months Since 2012

OpenAI released an analysis showing that compute required to match AlexNet-level ImageNet performance has decreased 44x since 2012, with algorithmic efficiency doubling every 16 months. This outpaces Moore's Law, which would have yielded only an 11x improvement over the same period. The findings suggest that for heavily-invested AI tasks, algorithmic progress is a larger driver of efficiency gains than hardware improvements alone.

Training Infrastructure Evaluation and Benchmarking AlexNet Moore's Law OpenAI +2 more

5arXiv · cs.LG·11d ago·source ↗

Local linear structures in LLM weights and activations are dynamic, not fixed global directions

A new arXiv paper investigates the nature of linear structures in transformer weights and activations, finding strong local low-rank task-gradient structure but rejecting the hypothesis that fixed task planes exist. The authors show that useful bases drift substantially within 100 optimization steps, yet early recovery updates form a trajectory-prefix basis capturing 77% of LoRA recovery displacement. They also establish a formal connection between parameter perturbations and activation steering, finding a 0.58 cosine similarity between gradient-step-induced activation shifts and CAA steering vectors, suggesting linear structures are evolving local geometries rather than stable global task directions.

Evaluation and Benchmarking Alignment and RLHF CAA Qwen-0.5B LoRA +4 more

5arXiv · cs.CL·3d ago·source ↗

Variable-Width Transformers: X-shaped architecture outperforms uniform-width baselines with 22% fewer FLOPs

Researchers propose the ><former (X-shaped transformer), a decoder-only architecture that uses wider early and late layers with narrower middle layers, implemented via a parameter-free residual resizing mechanism. Evaluated on models from 200M to 2B dense parameters and 3B MoE, the architecture consistently outperforms parameter-matched uniform-width baselines on language modeling loss. The design yields a 22% reduction in FLOPs and 15% reduction in KV cache memory under fitted scaling curves, suggesting nonuniform width allocation is a viable path to more compute-efficient language models.

Frontier Model Releases Inference Economics Q-Former Variable-Width Transformers

6arXiv · cs.LG·26d ago·source ↗

Training-Free Looped Transformers: Inference-Time Recurrence via ODE-Motivated Layer Reapplication

The paper introduces a method to retrofit recurrence onto frozen pretrained transformer checkpoints at inference time by looping a contiguous mid-stack block of layers without any fine-tuning or architectural changes. Naive block reapplication degrades performance, so the authors motivate their approach by treating pre-norm transformer blocks as forward Euler ODE steps and replacing one large update with smaller damped sub-steps. Evaluated across seven model families including dense, sparse MoE, and MLA+MoE architectures, the method yields consistent benchmark improvements (e.g., +2.64 pp on MMLU-Pro for Qwen3-4B-Instruct) at no training cost.

Frontier Model Releases Inference Economics CommonsenseQA OpenBookQA Forward Euler ODE +6 more

6Openai Blog·1mo ago·source ↗

How AI Training Scales: Gradient Noise Scale Predicts Batch Parallelizability

OpenAI researchers report that the gradient noise scale — a statistical metric measuring gradient variance relative to mean — reliably predicts the optimal batch size and degree of parallelizability across a wide range of neural network training tasks. The finding suggests that more complex tasks with noisier gradients can benefit from increasingly large batch sizes, removing a potential ceiling on scaling. The work frames training dynamics as a systematic, measurable process rather than empirical art.

Training Infrastructure Frontier Model Releases large-batch training OpenAI gradient noise scale