Almanac
technique

tensor parallelism

techniqueactivetensor-parallelism-5ff637f0·3 events·first seen 29d ago

Aliases: tensor parallelism

Co-occurring entities

More like this (12)

Recent events (3)

5Hugging Face Blog·28d ago·source ↗

Accelerate ND-Parallel: A Guide to Efficient Multi-GPU Training

Hugging Face published a guide on N-dimensional parallelism for multi-GPU training using the Accelerate library. The post covers combining data parallelism, tensor parallelism, pipeline parallelism, and other strategies to efficiently scale model training across GPU clusters. This is a practical technical resource aimed at practitioners working with large-scale distributed training setups.

5Hugging Face Blog·28d ago·source ↗

Introducing the Hugging Face LLM Inference Container for Amazon SageMaker

Hugging Face and Amazon Web Services have launched a dedicated LLM inference container for Amazon SageMaker, enabling optimized deployment of large language models on managed cloud infrastructure. The container is built on Hugging Face's Text Generation Inference (TGI) toolkit, which supports features like continuous batching, tensor parallelism, and quantization. This integration lowers the barrier for enterprise teams to deploy open-weight LLMs at scale on AWS without managing custom serving infrastructure.

6arXiv · cs.LG·29d ago·source ↗

RRFP: A Readiness-Driven Runtime for Pipeline-Parallel Training Under Runtime Variability

The paper introduces Runtime-Readiness-First Pipeline (RRFP), a new runtime for pipeline-parallel large-model training that treats schedules as non-binding hint orders rather than strict execution sequences. By combining message-driven asynchronous communication, lightweight tensor-parallel coordination, and ready-set arbitration, RRFP dynamically dispatches work based on actual task readiness, reducing idle bubbles and stage misalignment. Implemented on a Megatron-based framework and evaluated at up to 128 GPUs, RRFP achieves up to 1.77× speedup on language-only workloads and 2.77× on multimodal workloads versus fixed-order baselines, and outperforms the fastest comparable external system by up to 1.84×.