Almanac
← Events
4Hugging Face Blog·1mo ago

Fit More and Train Faster With ZeRO via DeepSpeed and FairScale

This Hugging Face blog post from January 2021 covers integration of ZeRO (Zero Redundancy Optimizer) memory optimization techniques via DeepSpeed and FairScale into the Transformers training ecosystem. ZeRO partitions optimizer states, gradients, and model parameters across GPUs to enable training of much larger models on the same hardware. The post serves as a practical guide for practitioners looking to scale model training without additional infrastructure investment.

Related guides (3)

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

Accelerate Large Model Training using DeepSpeed

This Hugging Face blog post explains how to use the Accelerate library in conjunction with DeepSpeed to train large language models more efficiently. It covers integration patterns, configuration options, and practical guidance for leveraging DeepSpeed's ZeRO optimization stages through the Accelerate abstraction layer. The post targets practitioners looking to scale model training without deep infrastructure expertise.

5Hugging Face Blog·1mo ago·source ↗

Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate

This Hugging Face blog post details inference optimization techniques for the BLOOM 176B parameter model using DeepSpeed ZeRO and Hugging Face Accelerate. The post provides PyTorch scripts and benchmarks demonstrating significant throughput improvements through tensor parallelism and other optimizations. It serves as a practical guide for deploying large open-weight models efficiently across multiple GPUs.

5Hugging Face Blog·1mo ago·source ↗

How Hugging Face Accelerate Runs Very Large Models Thanks to PyTorch

This Hugging Face blog post explains the technical mechanisms behind the Accelerate library for running large models that exceed single-GPU memory, leveraging PyTorch features such as device maps, CPU/disk offloading, and sharded checkpoints. It describes how models can be distributed across multiple GPUs, CPU RAM, and disk storage transparently. The post serves as both documentation and a technical explainer for practitioners working with large-scale inference and deployment.

4Hugging Face Blog·1mo ago·source ↗

Accelerate Large Model Training using PyTorch Fully Sharded Data Parallel

This Hugging Face blog post explains how to use PyTorch's Fully Sharded Data Parallel (FSDP) to train large models that exceed single-GPU memory limits. It covers the integration of FSDP with the Hugging Face Accelerate library, enabling distributed sharding of model parameters, gradients, and optimizer states across multiple GPUs. The post provides practical guidance on configuration and usage for scaling large model training.

4Hugging Face Blog·1mo ago·source ↗

Accelerating Hugging Face Transformers with AWS Inferentia2

Hugging Face published a blog post detailing how to accelerate Transformer model inference using AWS Inferentia2, Amazon's second-generation ML inference chip. The post covers integration patterns between the Hugging Face ecosystem and the Neuron SDK for deploying models on Inferentia2 hardware. This represents a practical guide for enterprise and cloud-based inference deployment using dedicated AI accelerators.

4Hugging Face Blog·1mo ago·source ↗

How Hugging Face Sped Up Transformer Inference 100x for API Customers

Hugging Face describes engineering optimizations that achieved up to 100x speedups in transformer inference for their hosted API customers. The post covers techniques applied to accelerate model serving at scale. This is a 2021 article documenting early inference optimization work at Hugging Face's inference API product.

4Hugging Face Blog·1mo ago·source ↗

Optimum + ONNX Runtime: Faster Training for Hugging Face Models

Hugging Face's Optimum library integrates with Microsoft's ONNX Runtime Training to accelerate fine-tuning of transformer models. The integration aims to reduce training time and memory usage with minimal code changes for practitioners using the Hugging Face ecosystem. This tooling update targets enterprise and research users looking to optimize training efficiency on existing hardware.

5Hugging Face Blog·1mo ago·source ↗

Introducing Optimum: The Optimization Toolkit for Transformers at Scale

Hugging Face announced Optimum, an optimization toolkit designed to accelerate Transformers models on various hardware backends. The toolkit aims to bridge the gap between Transformers model development and hardware-specific optimizations from partners. It provides a unified interface for quantization, pruning, and hardware-accelerated inference across different accelerators.