Ulysses Sequence Parallelism: Training with Million-Token Contexts
Hugging Face published a blog post on Ulysses sequence parallelism, a technique for distributing long-context training across multiple devices by partitioning the sequence dimension. The post covers how Ulysses enables training with million-token context windows by reducing per-device memory requirements. This is relevant to the ongoing challenge of scaling transformer training to very long sequences efficiently.
Related guides (3)
Related events (8)
Test-Time Training End-to-End (TTT-E2E) Retrains Model Weights to Handle Long Inputs
Researchers from Astera Institute, Nvidia, Stanford, UC Berkeley, and UC San Diego introduced TTT-E2E, a method that compresses long context into transformer weights by training the model during inference via meta-learning. The approach uses sliding-window attention restricted to 8,000 tokens and updates only the fully connected layers of the last quarter of the network on each 1,000-token chunk at inference time, keeping per-token generation latency roughly constant as context scales to 128,000 tokens. TTT-E2E slightly outperforms vanilla transformers on next-token prediction loss across long contexts and matches efficient architectures like Mamba 2 and Gated DeltaNet on inference speed, but fails dramatically on Needle-in-a-Haystack retrieval beyond 8,000 tokens and incurs substantially higher training latency. The work reframes long-context handling as a training-inference trade-off rather than an architectural design problem.
Language Models Need Sleep: Periodic Context Consolidation via Fast Weights and SSM Blocks
This paper proposes a sleep-like consolidation mechanism for transformer-based LLMs to address the quadratic scaling of attention with context length. During 'sleep' phases, the model performs N offline recurrent passes over accumulated context, updating fast weights in state-space model (SSM) blocks via a learned local rule, then clears the KV cache. The approach is evaluated on synthetic tasks (cellular automata, multi-hop graph retrieval) and math reasoning, where standard transformers and SSM-attention hybrids fail, with performance scaling with sleep duration N.
Improving Hugging Face Training Efficiency Through Packing with Flash Attention 2
Hugging Face published a blog post describing a technique for improving training efficiency by packing multiple short sequences into a single batch using Flash Attention 2. The approach reduces padding waste and improves GPU utilization during LLM fine-tuning. This is a practical infrastructure optimization relevant to practitioners training models on datasets with variable-length sequences.
The Reformer - Pushing the limits of language modeling
This Hugging Face blog post covers the Reformer, a memory-efficient transformer architecture that uses locality-sensitive hashing (LSH) attention and reversible residual layers to handle very long sequences. The post explains the technical mechanisms that allow Reformer to process sequences up to 1 million tokens with significantly reduced memory footprint compared to standard transformers. It serves as an educational deep-dive into the architectural innovations introduced in the original Reformer paper by Kitaev et al.
Train and Fine-Tune Sentence Transformers Models
This Hugging Face blog post provides a technical guide on training and fine-tuning Sentence Transformers models for producing dense sentence embeddings. It covers dataset preparation, loss function selection, and training configuration using the sentence-transformers library. The post targets practitioners building semantic search, clustering, or similarity systems.
Train a Sentence Embedding Model with 1B Training Pairs
This Hugging Face blog post describes a methodology for training sentence embedding models using approximately 1 billion training pairs. The post covers data curation, model architecture choices, and training strategies for large-scale contrastive learning of sentence representations. It serves as a practical guide for practitioners building semantic search and similarity systems.
The Technology Behind BLOOM Training
This Hugging Face blog post details the infrastructure and training methodology used to train BLOOM, a 176-billion parameter open-access multilingual language model. It covers the use of Megatron-DeepSpeed for distributed training across hundreds of GPUs, including tensor parallelism, pipeline parallelism, and data parallelism strategies. The post also discusses hardware setup, memory optimization techniques, and lessons learned during the large-scale training run.
DeepSeek-V4: a million-token context that agents can actually use
A Hugging Face blog post discusses DeepSeek-V4, highlighting its million-token context window as a practically usable capability for agentic applications. The post appears to analyze or announce DeepSeek-V4's long-context features in the context of agent workflows. No article body was available for deeper analysis.


