Entity · dataset

FineWeb

datasetactivefineweb-dc119a26·2 events·first seen Jun 3, 2026

Aliases: FineWeb

Co-occurring entities

GPT-2 Mobius Learning Muon q0: Primitives for Hyper-Epoch Pretraining

More like this (12)

FineWeb-Edu FineVideo FNet Mind2Web fine-tuning TabFM TabFM UST FinX FINO VideoFDB finetuning MiniF2F

Recent events (2)

5arXiv · cs.CL·Jul 21, 2026·source ↗

Mobius Learning: cyclic depth folding enables depth-role superposition in Transformers

Researchers introduce Mobius Learning, a training architecture where different data streams follow cyclically shifted block orders in a Transformer, forcing each block group to be optimized in both shallow and deep representational roles — a property they call depth-role superposition. Experiments with a modified GPT-2 small (124M) trained on 2.5B FineWeb tokens show lower validation loss than a fixed-order looped Transformer at larger numbers of block-sequence passes. The architecture is also naturally suited to memory-constrained distributed training, as each worker stores only one block group rather than the full model stack.

Training Infrastructure FineWeb GPT-2 Mobius Learning +1 more

6arXiv · cs.LG·Jun 3, 2026·source ↗

q0: Hyper-Epoch Pretraining turns multi-epoch budgets into diverse model populations for better generalization

A new arXiv preprint introduces hyper-epoch pretraining (q0), a framework that reframes multi-epoch training as exploration of a model population rather than refinement of a single model. The approach uses three primitives—cyclic schedules with anti-correlated learning rate and weight decay, chain distillation, and a learned prior for inference-time weighting—to achieve lower validation loss than single-model training. On a 1.8B-parameter model trained on FineWeb, q0 matches a 256-epoch ensemble baseline using only ~56 epochs (~4.6× fewer), with cumulative ~12.9× data efficiency under the Slowrun setting. The work directly addresses the emerging regime where compute scales faster than high-quality data supply.

Training Infrastructure Open Weights Progress FineWeb q0: Primitives for Hyper-Epoch Pretraining