
PyTorch
pytorch-c54a4cdc·14 events·first seen 1mo agoAliases: PyTorch
Co-occurring entities
More like this (12)
Recent events (14)
OpenAI standardizes on PyTorch
OpenAI announced in January 2020 that it is standardizing its deep learning framework on PyTorch. This marks a consolidation away from any internal or alternative frameworks toward the widely-adopted open-source library. The move signals organizational alignment on tooling infrastructure for all future research and development.
Hugging Face blog: Profiling PyTorch nn.Linear toward a fused MLP implementation
A Hugging Face blog post (Part 2 of a profiling series) walks through optimizing PyTorch's nn.Linear layers toward a fused MLP kernel. The post covers profiling methodology and kernel fusion techniques relevant to inference and training efficiency. This is a practical deep-dive into low-level PyTorch optimization for ML practitioners.
Visualize and Understand GPU Memory in PyTorch
A Hugging Face blog post explains how to visualize and analyze GPU memory usage during PyTorch model training. The post covers tools and techniques for understanding memory allocation patterns, helping practitioners diagnose and reduce memory bottlenecks. This is practical infrastructure knowledge relevant to training large models efficiently.
Accelerating PyTorch Transformers with Intel Sapphire Rapids - Part 2
This Hugging Face blog post covers inference optimization techniques for PyTorch Transformer models on Intel Sapphire Rapids (4th Gen Xeon) CPUs. It likely demonstrates performance gains using hardware-specific features such as AMX (Advanced Matrix Extensions) and BF16 support. The post is part of a series focused on making transformer inference more efficient on Intel server hardware without requiring GPU acceleration.
Accelerating PyTorch Transformers with Intel Sapphire Rapids - Part 1
This Hugging Face blog post covers hardware-level inference acceleration for PyTorch Transformer models using Intel's Sapphire Rapids Xeon processors. It likely details how the new AVX-512 and AMX (Advanced Matrix Extensions) instructions in Sapphire Rapids can speed up transformer workloads without requiring GPU hardware. The post is part one of a series, suggesting a practical, tutorial-oriented treatment of CPU-based inference optimization.
Accelerate Large Model Training using PyTorch Fully Sharded Data Parallel
This Hugging Face blog post explains how to use PyTorch's Fully Sharded Data Parallel (FSDP) to train large models that exceed single-GPU memory limits. It covers the integration of FSDP with the Hugging Face Accelerate library, enabling distributed sharding of model parameters, gradients, and optimizer states across multiple GPUs. The post provides practical guidance on configuration and usage for scaling large model training.
torchtune: PyTorch Native Post-Training Library for LLMs
Meta's PyTorch team introduces torchtune, a PyTorch-native library for post-training LLMs that emphasizes modularity, hackability, and direct access to underlying PyTorch components. The library supports fine-tuning, experimentation, and deployment-oriented workflows across distributed training settings. Benchmarked against popular frameworks Axolotl and Unsloth, torchtune demonstrates competitive performance and memory efficiency while maintaining flexibility for research iteration. The paper presents design principles, model builders, training recipes, and distributed training stack details.
Safetensors is Joining the PyTorch Foundation
The safetensors format, developed by Hugging Face as a secure and fast alternative to pickle-based model serialization, is being adopted under the PyTorch Foundation. This move formalizes safetensors as part of the broader PyTorch ecosystem, signaling growing standardization around safe model weight storage. The transition reflects increasing industry concern about supply-chain security in ML model distribution.
How Hugging Face Accelerate Runs Very Large Models Thanks to PyTorch
This Hugging Face blog post explains the technical mechanisms behind the Accelerate library for running large models that exceed single-GPU memory, leveraging PyTorch features such as device maps, CPU/disk offloading, and sharded checkpoints. It describes how models can be distributed across multiple GPUs, CPU RAM, and disk storage transparently. The post serves as both documentation and a technical explainer for practitioners working with large-scale inference and deployment.
Block Sparse Matrices for Smaller and Faster Language Models
This Hugging Face blog post introduces block sparse matrix techniques as a method to reduce the size and improve the inference speed of language models. Block sparsity enforces structured zero patterns in weight matrices, enabling hardware-friendly sparse operations compared to unstructured sparsity. The post likely covers implementation details and benchmarks showing efficiency gains for transformer-based models.
Accelerate 1.0.0 Released
Hugging Face has released Accelerate 1.0.0, marking the library's first stable major version. Accelerate is a widely-used PyTorch training library that abstracts distributed training across hardware configurations including multi-GPU, TPU, and mixed-precision setups. The 1.0.0 milestone signals API stability and production readiness for the training infrastructure ecosystem.
nanoVLM: Minimal Pure-PyTorch Repository for Training Vision-Language Models
Hugging Face published nanoVLM, a minimal open-source repository designed to make training vision-language models (VLMs) as simple as possible using pure PyTorch. The project aims to lower the barrier to entry for VLM research and experimentation by providing a clean, readable codebase without heavy abstractions. It follows in the tradition of educational ML repositories like nanoGPT, targeting researchers and practitioners who want to understand or customize VLM training from scratch.
Quanto: a PyTorch quantization backend for Optimum
Hugging Face introduced Quanto, a new PyTorch-based quantization backend integrated into the Optimum library. Quanto supports multiple quantization schemes and data types, targeting efficient inference for large language models and other neural networks. The tool is designed to work across hardware backends and integrates with the Hugging Face ecosystem.
Introducing 🤗 Accelerate
Hugging Face introduced Accelerate, a library designed to simplify distributed training of PyTorch models across multiple GPUs and TPUs with minimal code changes. The library abstracts away the complexity of multi-device training setups, allowing researchers to scale training with a few lines of code. This was a notable contribution to the ML training infrastructure ecosystem at the time of release.