6Google DeepMind Blog·1mo ago

Decoupled DiLoCo: A new frontier for resilient, distributed AI training

DeepMind has published a blog post introducing Decoupled DiLoCo, a new approach to distributed AI training designed for resilience across heterogeneous or unreliable compute environments. The method appears to extend the original DiLoCo (Distributed Low-Communication) training framework, which enables training across loosely connected compute nodes with infrequent synchronization. The announcement signals continued investment in infrastructure techniques that reduce communication overhead and improve fault tolerance in large-scale model training.

Training Infrastructure Inference Economics DiLoCo Decoupled DiLoCo Google DeepMind

Related guides (3)

Google DeepMind

Google DeepMind: The Lab Behind Gemini, AlphaFold, and Frontier AI

Read asBeginner In-depth

Training InfrastructureTopic guide

Training Infrastructure: The Compute Arms Race Powering Modern AI

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

Deep Learning over the Internet: Training Language Models Collaboratively

This Hugging Face blog post describes a framework for training large language models collaboratively across volunteer compute contributed over the internet. The approach addresses the challenge of enabling distributed participants with heterogeneous hardware to jointly train models without centralized infrastructure. It represents an early exploration of decentralized training as an alternative to large-scale private compute clusters.

Training Infrastructure Open Weights Progress collaborative distributed training Hugging Face volunteer compute

5arXiv · cs.CL·11d ago·source ↗

AdvGRPO: Stable co-training framework for adaptive red teaming of language models

Researchers introduce AdvGRPO, a co-training framework that makes GRPO viable for joint attacker-defender optimization in LLM red teaming, addressing previously reported instability. The method uses dense multi-channel rewards and decoupled advantage normalization, with a curriculum progressing from single-turn to multi-turn attacks before bootstrapping co-training. Co-trained defenders outperform baselines on safety benchmarks, and the attacks show transferability across models.

AI Safety Research Alignment and RLHF AdvGRPO GRPO PPO +1 more

6arXiv · cs.AI·10d ago·source ↗

Piper: Programmable distributed training system decoupling parallelism strategy from runtime

Researchers present Piper, a distributed training system that separates parallelism strategy specification from low-level runtime execution via an intermediate representation (IR) — a unified global training DAG. Users declare strategies through model annotations and scheduling directives, which Piper compiles into per-device execution plans. The system matches performance on standard strategies like ZeRO while enabling additional gains through joint compute-communication scheduling in composed strategies such as DeepSeek-V3's DualPipe.

Training Infrastructure Frontier Model Releases DeepSeek V4 Piper DualPipe +1 more

6Openai Blog·1mo ago·source ↗

OpenAI Introduces MRC (Multipath Reliable Connection) Networking Protocol for AI Training Clusters

OpenAI has developed and released MRC (Multipath Reliable Connection), a new supercomputer networking protocol designed to improve resilience and performance in large-scale AI training clusters. The protocol is being released through the Open Compute Project (OCP), making it available to the broader industry. MRC addresses reliability and throughput challenges in the high-bandwidth, low-latency interconnects required for frontier model training at scale.

Training Infrastructure Inference Economics Open Compute Project OpenAI MRC (Multipath Reliable Connection)

5arXiv · cs.CL·9d ago·source ↗

AGDO: Attention-guided denoising and optimization framework improves diffusion language model reasoning

Researchers propose AGDO, a framework that replaces random masking in diffusion large language models (dLLMs) with attention-guided denoising order and token weighting during fine-tuning and reinforcement learning. The work is motivated by an empirical finding that tokens with stronger attention to unmasked context are more stable and critical for reasoning. Experiments on math and coding benchmarks show AGDO outperforms existing post-training methods for dLLMs, advancing the case for attention-aware training in parallel-decoding language models.

Alignment and RLHF AGDO Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models

4arXiv · cs.CL·5d ago·source ↗

MoDiCoL: A modular continual learning dataset for diagnosing ASR robustness under distribution shift

Researchers introduce MoDiCoL, a benchmark dataset designed to evaluate automatic speech recognition robustness under co-occurring real-world distribution shifts including accents, recording conditions, speech impairments, and noise. Unlike existing benchmarks that isolate these factors, MoDiCoL enables controlled analysis across linguistic, speaker, and acoustic dimensions simultaneously. The paper also proposes a continual learning curriculum simulating incremental updates and evaluates three continual learning strategies for robustness acquisition and forgetting.

Evaluation and Benchmarking MoDiCoL

4Openai Blog·1mo ago·source ↗

Techniques for Training Large Neural Networks

OpenAI published a technical overview of the engineering and research challenges involved in training large neural networks across GPU clusters. The post covers the distributed computing and synchronization techniques required to orchestrate large-scale training runs. This serves as a reference document for the infrastructure and methods underpinning frontier model development.

Training Infrastructure large neural network training GPU cluster OpenAI

3Hugging Face Blog·1mo ago·source ↗

From PyTorch DDP to Accelerate to Trainer: Mastery of Distributed Training with Ease

This Hugging Face blog post walks through the progression from raw PyTorch DistributedDataParallel (DDP) to the Accelerate library to the Transformers Trainer API for distributed training. It explains the abstractions each layer provides and how they reduce boilerplate while maintaining flexibility. The post serves as a practical guide for ML practitioners scaling training across multiple GPUs or nodes.

Training Infrastructure PyTorch DDP Hugging Face Transformers Hugging Face +1 more