Almanac
← Events
5Hugging Face Blog·1mo ago

Ulysses Sequence Parallelism: Training with Million-Token Contexts

Hugging Face published a blog post on Ulysses sequence parallelism, a technique for distributing long-context training across multiple devices by partitioning the sequence dimension. The post covers how Ulysses enables training with million-token context windows by reducing per-device memory requirements. This is relevant to the ongoing challenge of scaling transformer training to very long sequences efficiently.

Related guides (3)

Related events (8)

6The Batch·19d ago·source ↗

Test-Time Training End-to-End (TTT-E2E) Retrains Model Weights to Handle Long Inputs

Researchers from Astera Institute, Nvidia, Stanford, UC Berkeley, and UC San Diego introduced TTT-E2E, a method that compresses long context into transformer weights by training the model during inference via meta-learning. The approach uses sliding-window attention restricted to 8,000 tokens and updates only the fully connected layers of the last quarter of the network on each 1,000-token chunk at inference time, keeping per-token generation latency roughly constant as context scales to 128,000 tokens. TTT-E2E slightly outperforms vanilla transformers on next-token prediction loss across long contexts and matches efficient architectures like Mamba 2 and Gated DeltaNet on inference speed, but fails dramatically on Needle-in-a-Haystack retrieval beyond 8,000 tokens and incurs substantially higher training latency. The work reframes long-context handling as a training-inference trade-off rather than an architectural design problem.

6arXiv · cs.CL·25d ago·source ↗

Language Models Need Sleep: Periodic Context Consolidation via Fast Weights and SSM Blocks

This paper proposes a sleep-like consolidation mechanism for transformer-based LLMs to address the quadratic scaling of attention with context length. During 'sleep' phases, the model performs N offline recurrent passes over accumulated context, updating fast weights in state-space model (SSM) blocks via a learned local rule, then clears the KV cache. The approach is evaluated on synthetic tasks (cellular automata, multi-hop graph retrieval) and math reasoning, where standard transformers and SSM-attention hybrids fail, with performance scaling with sleep duration N.

4Hugging Face Blog·1mo ago·source ↗

Improving Hugging Face Training Efficiency Through Packing with Flash Attention 2

Hugging Face published a blog post describing a technique for improving training efficiency by packing multiple short sequences into a single batch using Flash Attention 2. The approach reduces padding waste and improves GPU utilization during LLM fine-tuning. This is a practical infrastructure optimization relevant to practitioners training models on datasets with variable-length sequences.

4Hugging Face Blog·1mo ago·source ↗

The Reformer - Pushing the limits of language modeling

This Hugging Face blog post covers the Reformer, a memory-efficient transformer architecture that uses locality-sensitive hashing (LSH) attention and reversible residual layers to handle very long sequences. The post explains the technical mechanisms that allow Reformer to process sequences up to 1 million tokens with significantly reduced memory footprint compared to standard transformers. It serves as an educational deep-dive into the architectural innovations introduced in the original Reformer paper by Kitaev et al.

3Hugging Face Blog·1mo ago·source ↗

Train and Fine-Tune Sentence Transformers Models

This Hugging Face blog post provides a technical guide on training and fine-tuning Sentence Transformers models for producing dense sentence embeddings. It covers dataset preparation, loss function selection, and training configuration using the sentence-transformers library. The post targets practitioners building semantic search, clustering, or similarity systems.

4Hugging Face Blog·1mo ago·source ↗

Train a Sentence Embedding Model with 1B Training Pairs

This Hugging Face blog post describes a methodology for training sentence embedding models using approximately 1 billion training pairs. The post covers data curation, model architecture choices, and training strategies for large-scale contrastive learning of sentence representations. It serves as a practical guide for practitioners building semantic search and similarity systems.

6Hugging Face Blog·1mo ago·source ↗

The Technology Behind BLOOM Training

This Hugging Face blog post details the infrastructure and training methodology used to train BLOOM, a 176-billion parameter open-access multilingual language model. It covers the use of Megatron-DeepSpeed for distributed training across hundreds of GPUs, including tensor parallelism, pipeline parallelism, and data parallelism strategies. The post also discusses hardware setup, memory optimization techniques, and lessons learned during the large-scale training run.

6Hugging Face Blog·1mo ago·source ↗

DeepSeek-V4: a million-token context that agents can actually use

A Hugging Face blog post discusses DeepSeek-V4, highlighting its million-token context window as a practically usable capability for agentic applications. The post appears to analyze or announce DeepSeek-V4's long-context features in the context of agent workflows. No article body was available for deeper analysis.