Topic

Training Infrastructure

activetraining-infrastructure·282 events·last 45h ago

Compute, chips, training runs, distributed training systems, data center buildouts, and the hardware/systems side of pre-training.

Related entities

Guides (1)

Training InfrastructureTopic guide

Training Infrastructure: The Compute Arms Race Powering Modern AI

Read asBeginner In-depth

Recent events (50)

9Anthropic News·1mo ago·source ↗

Anthropic and Amazon Expand Collaboration for Up to 5 Gigawatts of New Compute

Anthropic has signed a major expanded agreement with Amazon committing over $100 billion to AWS technologies over ten years, securing up to 5GW of compute capacity for training and deploying Claude across Trainium2 through Trainium4 chips. Amazon is investing an additional $5 billion in Anthropic today, with up to $20 billion more possible in the future, building on $8 billion previously invested. The deal includes nearly 1GW of Trainium2 and Trainium3 capacity coming online by end of 2026, expanded inference in Asia and Europe, and the full Claude Platform becoming available directly within AWS. Anthropic disclosed its run-rate revenue has surpassed $30 billion, up from approximately $9 billion at end of 2025.

Training Infrastructure Frontier Model Releases Dario Amodei Claude Platform Amazon Bedrock +9 more

8Anthropic News·1mo ago·source ↗

Anthropic Expands Partnership with Google and Broadcom for Multi-Gigawatt TPU Compute Capacity

Anthropic has signed a new agreement with Google and Broadcom for multiple gigawatts of next-generation TPU capacity expected to come online starting in 2027, representing the company's largest compute commitment to date. The announcement coincides with Anthropic reporting run-rate revenue surpassing $30 billion, up from ~$9 billion at end of 2025, and the number of enterprise customers spending over $1M annually doubling to 1,000+ in under two months. The compute will be predominantly US-sited, extending Anthropic's November 2025 $50B American infrastructure commitment. Anthropic continues to operate across AWS Trainium, Google TPUs, and NVIDIA GPUs, with Amazon remaining its primary cloud and training partner.

Training Infrastructure Frontier Model Releases Google TPU Broadcom Claude +10 more

6Openai Blog·1mo ago·source ↗

Building the compute infrastructure for the Intelligence Age

OpenAI is scaling its Stargate initiative to expand compute infrastructure aimed at supporting AGI development. The announcement describes new data center capacity additions to meet growing AI demand. This represents a continuation of OpenAI's large-scale infrastructure buildout strategy under the Stargate program.

Training Infrastructure Inference Economics Stargate OpenAI +1 more

8Anthropic News·1mo ago·source ↗

Anthropic Announces SpaceX Colossus Compute Deal and Higher Claude Usage Limits

Anthropic has signed an agreement with SpaceX to access the full compute capacity of the Colossus 1 data center, gaining over 300 megawatts and 220,000+ NVIDIA GPUs within a month. This deal, combined with prior agreements with Amazon, Google/Broadcom, Microsoft/NVIDIA, and Fluidstack, enables Anthropic to double Claude Code rate limits, remove peak-hour restrictions for Pro/Max users, and raise API rate limits for Claude Opus models. The announcement also notes interest in developing orbital AI compute capacity with SpaceX, and outlines international infrastructure expansion for enterprise compliance needs.

Training Infrastructure Frontier Model Releases Google TPU Claude Opus 4.6 Microsoft +12 more

6arXiv · cs.LG·1mo ago·source ↗

RRFP: A Readiness-Driven Runtime for Pipeline-Parallel Training Under Runtime Variability

The paper introduces Runtime-Readiness-First Pipeline (RRFP), a new runtime for pipeline-parallel large-model training that treats schedules as non-binding hint orders rather than strict execution sequences. By combining message-driven asynchronous communication, lightweight tensor-parallel coordination, and ready-set arbitration, RRFP dynamically dispatches work based on actual task readiness, reducing idle bubbles and stage misalignment. Implemented on a Megatron-based framework and evaluated at up to 128 GPUs, RRFP achieves up to 1.77× speedup on language-only workloads and 2.77× on multimodal workloads versus fixed-order baselines, and outperforms the fastest comparable external system by up to 1.84×.

Training Infrastructure Inference Economics tensor parallelism pipeline parallelism BFW schedule hint +2 more

6Hugging Face Blog·1mo ago·source ↗

The Technology Behind BLOOM Training

This Hugging Face blog post details the infrastructure and training methodology used to train BLOOM, a 176-billion parameter open-access multilingual language model. It covers the use of Megatron-DeepSpeed for distributed training across hundreds of GPUs, including tensor parallelism, pipeline parallelism, and data parallelism strategies. The post also discusses hardware setup, memory optimization techniques, and lessons learned during the large-scale training run.

Training Infrastructure Open Weights Progress BLOOM DeepSpeed Hugging Face +2 more

8Openai Blog·1mo ago·source ↗

AWS and OpenAI Announce $38B Multi-Year Strategic Partnership

OpenAI and Amazon Web Services have announced a multi-year strategic partnership valued at $38 billion. AWS will supply infrastructure and compute capacity to support OpenAI's next-generation model training and deployment workloads. The deal represents a major cloud infrastructure commitment for OpenAI alongside its existing Microsoft Azure relationship.

Training Infrastructure Frontier Model Releases Microsoft Azure OpenAI Amazon Web Services +2 more

8Openai Blog·1mo ago·source ↗

OpenAI and Broadcom Announce Strategic Collaboration to Deploy 10 GW of OpenAI-Designed AI Accelerators

OpenAI and Broadcom have announced a multi-year strategic partnership targeting deployment of 10 gigawatts of OpenAI-designed AI accelerators by 2029. The collaboration involves co-developing next-generation AI accelerator systems and Ethernet networking solutions aimed at scalable, energy-efficient AI infrastructure. This represents OpenAI's continued push into custom silicon, reducing dependence on third-party chip suppliers like NVIDIA.

Training Infrastructure Inference Economics Broadcom NVIDIA OpenAI +2 more

8Openai Blog·1mo ago·source ↗

AMD and OpenAI Announce Strategic Partnership to Deploy 6 Gigawatts of AMD GPUs

AMD and OpenAI have entered a multi-year strategic partnership to deploy 6 gigawatts of AMD Instinct GPUs for OpenAI's AI infrastructure, with 1 gigawatt planned for 2026. The deal represents a significant diversification of OpenAI's compute supply beyond its existing NVIDIA dependency. This is one of the largest publicly announced GPU deployment commitments in the industry.

Training Infrastructure Frontier Model Releases AMD Instinct NVIDIA OpenAI +2 more

7Openai Blog·1mo ago·source ↗

OpenAI, Oracle, and SoftBank expand Stargate with five new AI datacenter sites

OpenAI, Oracle, and SoftBank have announced five additional datacenter sites under the Stargate initiative, a $500 billion U.S. AI infrastructure program targeting 10 gigawatts of compute capacity. The expansion accelerates the buildout of physical infrastructure intended to support next-generation AI workloads. The announcement emphasizes domestic job creation alongside the technical capacity additions.

Training Infrastructure Inference Economics Stargate Oracle OpenAI +2 more

8Openai Blog·1mo ago·source ↗

OpenAI and NVIDIA Announce Strategic Partnership to Deploy 10 Gigawatts of AI Datacenters

OpenAI and NVIDIA have announced a strategic partnership targeting deployment of 10 gigawatts of AI datacenter capacity powered by NVIDIA systems. The first phase of the buildout is scheduled to launch in 2026. This represents a major infrastructure commitment between two of the most prominent organizations in AI compute and model development.

Training Infrastructure Frontier Model Releases NVIDIA OpenAI +1 more

7Openai Blog·1mo ago·source ↗

OpenAI Launches Stargate Infrastructure Partner Outreach

OpenAI is soliciting partnerships with firms across the data center infrastructure supply chain—covering power, land, construction, and equipment—under the Stargate initiative. This represents OpenAI's formal outreach to the industrial base to build out large-scale AGI infrastructure. The announcement signals the operational phase of the previously announced Stargate project, which involves major capital commitments for AI compute infrastructure.

Training Infrastructure Inference Economics Stargate OpenAI +1 more

9Openai Blog·1mo ago·source ↗

Announcing The Stargate Project

OpenAI has announced the Stargate Project, a major AI infrastructure initiative. The project represents a large-scale investment in AI compute and data center infrastructure in the United States. Based on prior reporting, Stargate involves a joint venture with SoftBank and other partners targeting up to $500 billion in AI infrastructure investment over four years. This is one of the largest announced AI infrastructure commitments in history.

Training Infrastructure Frontier Model Releases Masayoshi Son Microsoft Oracle +6 more

6Openai Blog·1mo ago·source ↗

Scaling Kubernetes to 7,500 Nodes

OpenAI describes scaling Kubernetes clusters to 7,500 nodes to support large-scale AI training workloads including GPT-3, CLIP, and DALL·E. The post details infrastructure challenges and solutions enabling both massive model training and rapid small-scale research iteration. This represents a significant engineering milestone in ML training infrastructure at the time of publication (January 2021).

Training Infrastructure Frontier Model Releases GPT-3 Kubernetes DALL·E 3 +3 more

7Openai Blog·19d ago·source ↗

OpenAI Breaks Ground on 1GW Stargate Data Center in Michigan

OpenAI has broken ground on a 1-gigawatt data center in Michigan as part of its Stargate infrastructure initiative. The project is framed around expanding AI access, job creation, and community support. This represents a major physical infrastructure commitment by OpenAI to domestic AI compute capacity.

Training Infrastructure Inference Economics Stargate Michigan Data Center OpenAI

8Anthropic News·19d ago·source ↗

Anthropic Commits $50 Billion to U.S. AI Computing Infrastructure with Fluidstack

Anthropic is investing $50 billion in American AI computing infrastructure, partnering with Fluidstack to build custom data centers in Texas and New York, with additional sites planned. The facilities are purpose-built for Anthropic's workloads and are expected to come online throughout 2026, creating roughly 800 permanent and 2,400 construction jobs. The announcement aligns with the Trump administration's AI Action Plan and is framed as supporting domestic AI leadership. Anthropic cites growing enterprise demand—over 300,000 business customers and a sevenfold increase in large accounts over the past year—as driving the scale of investment.

Training Infrastructure Frontier Model Releases Dario Amodei Claude Fluidstack +6 more

8Anthropic News·19d ago·source ↗

Anthropic Expands Google Cloud TPU Usage to Up to One Million TPUs in Tens-of-Billions Deal

Anthropic announced a major expansion of its Google Cloud infrastructure, planning to use up to one million TPUs in a deal worth tens of billions of dollars, with over a gigawatt of capacity expected online in 2026. The expansion is driven by rapidly growing enterprise demand—Anthropic now serves over 300,000 business customers with large accounts growing nearly 7x year-over-year. Anthropic maintains a diversified compute strategy across Google TPUs, Amazon Trainium, and NVIDIA GPUs, while reaffirming its primary training partnership with Amazon via Project Rainier. The company also notes the expanded compute will support alignment research and responsible deployment at scale.

Training Infrastructure Frontier Model Releases Google Cloud Amazon Trainium2 Claude +11 more

6Google Deepmind Blog·1mo ago·source ↗

Decoupled DiLoCo: A new frontier for resilient, distributed AI training

DeepMind has published a blog post introducing Decoupled DiLoCo, a new approach to distributed AI training designed for resilience across heterogeneous or unreliable compute environments. The method appears to extend the original DiLoCo (Distributed Low-Communication) training framework, which enables training across loosely connected compute nodes with infrequent synchronization. The announcement signals continued investment in infrastructure techniques that reduce communication overhead and improve fault tolerance in large-scale model training.

Training Infrastructure Inference Economics DiLoCo Decoupled DiLoCo Google DeepMind

6Openai Blog·1mo ago·source ↗

OpenAI Introduces MRC (Multipath Reliable Connection) Networking Protocol for AI Training Clusters

OpenAI has developed and released MRC (Multipath Reliable Connection), a new supercomputer networking protocol designed to improve resilience and performance in large-scale AI training clusters. The protocol is being released through the Open Compute Project (OCP), making it available to the broader industry. MRC addresses reliability and throughput challenges in the high-bandwidth, low-latency interconnects required for frontier model training at scale.

Training Infrastructure Inference Economics Open Compute Project OpenAI MRC (Multipath Reliable Connection)

6arXiv · cs.AI·10d ago·source ↗

Piper: Programmable distributed training system decoupling parallelism strategy from runtime

Researchers present Piper, a distributed training system that separates parallelism strategy specification from low-level runtime execution via an intermediate representation (IR) — a unified global training DAG. Users declare strategies through model annotations and scheduling directives, which Piper compiles into per-device execution plans. The system matches performance on standard strategies like ZeRO while enabling additional gains through joint compute-communication scheduling in composed strategies such as DeepSeek-V3's DualPipe.

Training Infrastructure Frontier Model Releases DeepSeek V4 Piper DualPipe +1 more

7Meta Ai Blog·1mo ago·source ↗

Meta Announces Four MTIA AI Chip Generations in Two Years: MTIA 300–500 Roadmap

Meta has detailed a rapid four-generation MTIA chip roadmap (300, 400, 450, 500) developed in partnership with Broadcom, spanning ranking/recommendation inference and training through general GenAI workloads. Key advances include a 4.5x HBM bandwidth increase and 25x compute FLOPS improvement from MTIA 300 to 500, with MTIA 450 and 500 targeting GenAI inference with doubled and further-increased HBM bandwidth versus leading commercial products. MTIA 300 is in production for R&R training, MTIA 400 is lab-tested and entering deployment, while MTIA 450 and 500 are scheduled for mass deployment in early 2027 and 2027 respectively. The strategy emphasizes modular chiplet design and short iteration cycles to keep hardware aligned with rapidly evolving AI model requirements.

Training Infrastructure Frontier Model Releases RISC-V Broadcom HBM (High-Bandwidth Memory)+8 more

7Latent Space·1mo ago·source ↗

Anthropic-SpaceX AI's 300MW/$5B/yr Colossus I Deal; ARR Growth 8000% Annualized

Latent Space AINews reports that Anthropic has struck a major infrastructure deal with SpaceX AI involving 300MW of compute capacity at the Colossus I data center for approximately $5B per year. The report also highlights Anthropic's annualized ARR growth of 8000%, signaling rapid commercial scaling. This represents a significant strategic alignment between Anthropic and xAI/SpaceX infrastructure assets.

Training Infrastructure Frontier Model Releases Colossus 1 xAI SpaceX AI +4 more

6Hugging Face Blog·1mo ago·source ↗

Hugging Face and NVIDIA Launch Training Cluster as a Service

Hugging Face and NVIDIA are announcing a joint 'Training Cluster as a Service' offering, providing managed GPU cluster access for AI model training. The collaboration aims to lower the barrier for organizations to access large-scale training infrastructure without managing hardware directly. This represents a strategic partnership between a major AI platform and a leading GPU manufacturer to address enterprise training infrastructure needs.

Training Infrastructure Inference Economics NVIDIA Hugging Face Training Cluster as a Service +1 more

7Openai Blog·1mo ago·source ↗

OpenAI and SoftBank Group Partner with SB Energy for Multi-Gigawatt AI Data Center Campuses

OpenAI and SoftBank Group have announced a partnership with SB Energy to develop multi-gigawatt AI data center campuses. The initiative includes a 1.2 GW facility in Texas that will support the Stargate project. This represents a major infrastructure investment aimed at scaling AI compute capacity.

Training Infrastructure Inference Economics Stargate SB Energy OpenAI +2 more

6Openai Blog·1mo ago·source ↗

Expanding Stargate to Michigan

OpenAI is expanding its Stargate AI infrastructure initiative to Michigan with a new one-gigawatt campus. The project is framed as strengthening U.S. AI infrastructure while creating jobs and driving investment in the Midwest. No technical specifications or timeline details are provided in the announcement.

Training Infrastructure Inference Economics Stargate Michigan OpenAI

7Openai Blog·1mo ago·source ↗

Samsung and SK join OpenAI's Stargate initiative to advance global AI infrastructure

Samsung and SK Group have joined OpenAI's Stargate initiative, expanding the program's global footprint. The partnership focuses on scaling advanced memory chip production and constructing next-generation data centers in South Korea. This extends Stargate beyond its initial US-centric scope into a major Asian manufacturing and compute hub.

Training Infrastructure Inference Economics Stargate South Korea OpenAI +3 more

7Openai Blog·1mo ago·source ↗

OpenAI and Oracle Expand Stargate with 4.5 GW U.S. Data Center Partnership

OpenAI and Oracle have signed an agreement to develop 4.5 gigawatts of additional Stargate data center capacity in the United States. The deal represents a major infrastructure expansion for OpenAI's Stargate platform, which is positioned as the long-term backbone for delivering AI at scale. The announcement emphasizes job creation, U.S. reindustrialization, and domestic AI leadership.

Training Infrastructure Inference Economics Stargate Oracle OpenAI +1 more

9Openai Blog·1mo ago·source ↗

Microsoft Invests $1 Billion in OpenAI, Becomes Exclusive Cloud Provider

Microsoft announced a $1 billion investment in OpenAI in July 2019, establishing a strategic partnership aimed at building AGI with broadly distributed economic benefits. As part of the deal, Microsoft becomes OpenAI's exclusive cloud provider, and the two companies will jointly develop Azure AI supercomputing infrastructure. This partnership laid the foundation for OpenAI's large-scale model training on Azure and subsequent deeper integrations between the two organizations.

Training Infrastructure Frontier Model Releases Microsoft Microsoft Azure AGI +3 more

6Hugging Face Blog·24d ago·source ↗

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

Hugging Face introduces Delta Weight Sync in TRL, a technique for efficiently synchronizing model weight updates during large-scale training by transmitting only the delta (difference) between checkpoints rather than full parameter snapshots. The approach targets trillion-parameter training regimes where checkpoint bandwidth is a significant bottleneck. The post describes integration with the Hugging Face Hub as a storage and distribution layer for these delta updates.

Training Infrastructure Inference Economics Hugging Face Delta Weight Sync TRL

9Anthropic News·19d ago·source ↗

Microsoft, NVIDIA, and Anthropic Announce Major Strategic Partnerships with $15B Investment and $30B Azure Compute Commitment

Anthropic has announced simultaneous strategic partnerships with Microsoft and NVIDIA, committing to purchase $30 billion of Azure compute capacity and up to one gigawatt of compute with NVIDIA Grace Blackwell and Vera Rubin systems. NVIDIA and Microsoft are investing up to $10 billion and $5 billion respectively in Anthropic, while Claude models (Sonnet 4.5, Opus 4.1, Haiku 4.5) will be available on Microsoft Foundry and across the Copilot product family. Anthropic and NVIDIA are also establishing a deep technology partnership to co-optimize model performance and future NVIDIA architectures for Anthropic workloads. Amazon remains Anthropic's primary cloud and training partner.

Training Infrastructure Frontier Model Releases Dario Amodei Microsoft Copilot Claude Opus 4.6 +18 more

6arXiv · cs.AI·1mo ago·source ↗

Framework for Evaluating Datacenter Power Delivery Hierarchies for AI Workloads

Researchers from Microsoft Azure present a simulation framework for evaluating datacenter power delivery designs under AI-era conditions, where rack power density is projected to approach 1MW per deployment by 2027. The framework combines GPU/compute/storage projection models with production operational data to assess throughput, power, and cost metrics across realistic deployment sequences. Key findings show that multi-resource stranding materially affects deployable capacity and effective capital expenditure, and that the correct planning objective is deployable capacity over time rather than installed megawatts. The work addresses the challenge of designing power hierarchies that remain efficient across multiple hardware generations as AI accelerator density rises.

Training Infrastructure Inference Economics power oversubscription datacenter power delivery hierarchy multi-resource stranding +3 more

6Qwen Research·1mo ago·source ↗

Global-batch Load Balancing for MoE LLM Training from Qwen

Qwen Research introduces a global-batch load balancing technique for Mixture-of-Experts (MoE) LLM training, claiming it is nearly a 'free lunch' improvement. The method addresses expert load imbalance across training batches, a known efficiency and quality bottleneck in MoE architectures. The approach targets the router and expert activation dynamics in transformer-based MoE layers.

Training Infrastructure Frontier Model Releases Global-batch Load Balancing Alibaba Qwen +1 more

6Hugging Face Blog·1mo ago·source ↗

Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries

A Hugging Face blog post surveys 16 open-source reinforcement learning libraries for LLM training, analyzing their architectural approaches to async and synchronous token generation pipelines. The piece distills practical lessons about throughput, scalability, and design trade-offs across the ecosystem. It serves as a comparative landscape analysis for practitioners building or choosing RL training infrastructure for language models.

Training Infrastructure Open Weights Progress OpenRLHF Reinforcement Learning from Human Feedback veRL +4 more

5Hugging Face Blog·1mo ago·source ↗

Accelerate ND-Parallel: A Guide to Efficient Multi-GPU Training

Hugging Face published a guide on N-dimensional parallelism for multi-GPU training using the Accelerate library. The post covers combining data parallelism, tensor parallelism, pipeline parallelism, and other strategies to efficiently scale model training across GPU clusters. This is a practical technical resource aimed at practitioners working with large-scale distributed training setups.

Training Infrastructure Agent and Tool Ecosystem N-Dimensional Parallelism tensor parallelism pipeline parallelism +3 more

5Hugging Face Blog·1mo ago·source ↗

Accelerate 1.0.0 Released

Hugging Face has released Accelerate 1.0.0, marking the library's first stable major version. Accelerate is a widely-used PyTorch training library that abstracts distributed training across hardware configurations including multi-GPU, TPU, and mixed-precision setups. The 1.0.0 milestone signals API stability and production readiness for the training infrastructure ecosystem.

Training Infrastructure Open Weights Progress Accelerate Hugging Face PyTorch

4Hugging Face Blog·1mo ago·source ↗

From DeepSpeed to FSDP and Back Again with Hugging Face Accelerate

This Hugging Face blog post covers the practical migration path between DeepSpeed and PyTorch FSDP distributed training backends using the Accelerate library. It addresses configuration differences, compatibility considerations, and workflow patterns for switching between the two frameworks. The post targets practitioners running large-scale model training who need flexibility across distributed training strategies.

Training Infrastructure PyTorch FSDP DeepSpeed Hugging Face +1 more

6Hugging Face Blog·1mo ago·source ↗

GaLore: Advancing Large Model Training on Consumer-grade Hardware

GaLore (Gradient Low-Rank Projection) is a memory-efficient training technique that reduces optimizer state memory by projecting gradients into a low-rank subspace during training, enabling large model training on consumer-grade hardware. The Hugging Face blog post covers integration of GaLore into the transformers and peft ecosystems. Unlike LoRA, GaLore applies low-rank projection to the full training process rather than constraining weight updates, allowing full-parameter learning with reduced memory footprint. This makes training models like LLaMA-7B feasible on single consumer GPUs.

Training Infrastructure Open Weights Progress PEFT LoRA LLaMA-7B +3 more

5Hugging Face Blog·1mo ago·source ↗

Easily Train Models with H100 GPUs on NVIDIA DGX Cloud

Hugging Face announced integration with NVIDIA DGX Cloud, enabling users to train models on H100 GPU clusters directly through the Hugging Face platform. The partnership simplifies access to high-end training infrastructure without requiring users to manage cloud provisioning themselves. This represents a continued push to lower the barrier to large-scale model training for the broader ML community.

Training Infrastructure Inference Economics NVIDIA NVIDIA DGX Cloud H100 +2 more

4Hugging Face Blog·1mo ago·source ↗

Accelerate Large Model Training using DeepSpeed

This Hugging Face blog post explains how to use the Accelerate library in conjunction with DeepSpeed to train large language models more efficiently. It covers integration patterns, configuration options, and practical guidance for leveraging DeepSpeed's ZeRO optimization stages through the Accelerate abstraction layer. The post targets practitioners looking to scale model training without deep infrastructure expertise.

Training Infrastructure Agent and Tool Ecosystem Microsoft DeepSpeed Hugging Face +2 more

4Hugging Face Blog·1mo ago·source ↗

Accelerate Large Model Training using PyTorch Fully Sharded Data Parallel

This Hugging Face blog post explains how to use PyTorch's Fully Sharded Data Parallel (FSDP) to train large models that exceed single-GPU memory limits. It covers the integration of FSDP with the Hugging Face Accelerate library, enabling distributed sharding of model parameters, gradients, and optimizer states across multiple GPUs. The post provides practical guidance on configuration and usage for scaling large model training.

Training Infrastructure PyTorch FSDP Hugging Face Hugging Face Accelerate +1 more

4Hugging Face Blog·1mo ago·source ↗

Fit More and Train Faster With ZeRO via DeepSpeed and FairScale

This Hugging Face blog post from January 2021 covers integration of ZeRO (Zero Redundancy Optimizer) memory optimization techniques via DeepSpeed and FairScale into the Transformers training ecosystem. ZeRO partitions optimizer states, gradients, and model parameters across GPUs to enable training of much larger models on the same hardware. The post serves as a practical guide for practitioners looking to scale model training without additional infrastructure investment.

Training Infrastructure Inference Economics Meta AI Microsoft DeepSpeed +4 more

7Openai Blog·1mo ago·source ↗

Introducing Stargate UK

OpenAI has announced Stargate UK, an extension of its Stargate infrastructure initiative into the United Kingdom. The announcement signals a major AI infrastructure investment in the UK, likely involving data center buildout and compute capacity. This follows the US Stargate program and represents a significant international expansion of OpenAI's infrastructure strategy.

Training Infrastructure Enterprise Deployment Patterns Stargate United Kingdom OpenAI +1 more

6Openai Blog·1mo ago·source ↗

Introducing Stargate Norway: OpenAI's First European AI Data Center Initiative

OpenAI is launching Stargate Norway, its first AI data center initiative in Europe, under the OpenAI for Countries program. This represents an expansion of the Stargate infrastructure platform beyond the United States into European markets. The announcement positions Stargate as OpenAI's overarching infrastructure platform central to its long-term global deployment strategy.

Training Infrastructure Enterprise Deployment Patterns Stargate Norway Stargate Norway +3 more

7Openai Blog·1mo ago·source ↗

Introducing Stargate UAE

OpenAI is launching Stargate UAE, marking the first international deployment of its Stargate AI infrastructure platform. This expansion takes the large-scale compute infrastructure initiative beyond the United States for the first time. The announcement signals OpenAI's intent to build out global AI infrastructure capacity in partnership with regional stakeholders in the UAE.

Training Infrastructure Enterprise Deployment Patterns Stargate OpenAI United Arab Emirates +2 more

7Openai Blog·1mo ago·source ↗

Introducing Triton: Open-source GPU programming for neural networks

OpenAI released Triton 1.0, an open-source Python-like language for GPU programming targeting neural network workloads. It enables researchers without CUDA expertise to write highly efficient GPU kernels, reportedly matching expert-level performance in most cases. The release lowers the barrier to custom GPU kernel development for ML practitioners.

Training Infrastructure Inference Economics Triton Python OpenAI +2 more

6Openai Blog·1mo ago·source ↗

How AI Training Scales: Gradient Noise Scale Predicts Batch Parallelizability

OpenAI researchers report that the gradient noise scale — a statistical metric measuring gradient variance relative to mean — reliably predicts the optimal batch size and degree of parallelizability across a wide range of neural network training tasks. The finding suggests that more complex tasks with noisier gradients can benefit from increasingly large batch sizes, removing a potential ceiling on scaling. The work frames training dynamics as a systematic, measurable process rather than empirical art.

Training Infrastructure Frontier Model Releases large-batch training OpenAI gradient noise scale

7Openai Blog·1mo ago·source ↗

AI and Compute: OpenAI Analysis of Exponential Growth in Training Compute Since 2012

OpenAI published an analysis in May 2018 showing that compute used in the largest AI training runs has been doubling every 3.4 months since 2012, far outpacing Moore's Law's 2-year doubling period. Over the 2012–2018 period, this metric grew by more than 300,000x. The analysis frames compute scaling as a key driver of AI progress and argues for preparing for systems with capabilities well beyond those of the time.

Training Infrastructure Frontier Model Releases Moore's Law OpenAI AI and Compute +1 more

6arXiv · cs.CL·1mo ago·source ↗

ChunkFT: Memory-Efficient Full Fine-Tuning via Byte-Streamed Chunk Optimization

ChunkFT is a fine-tuning framework that reformulates full-parameter optimization around a dynamically activated working set of sub-tensors, enabling gradient computation without dense gradient materialization. It achieves full-parameter fine-tuning of a 7B model in 13.72GB GPU memory on a single RTX 4090, and scales Llama 3-70B fine-tuning to 2×H800 GPUs. Downstream evaluations on language understanding, math reasoning, and MT-Bench show ChunkFT matches or exceeds full-parameter fine-tuning quality while outperforming existing memory-efficient baselines such as LoRA-class methods. A theoretical convergence analysis in the deterministic setting is also provided.

Training Infrastructure Open Weights Progress Llama 3.1 70B MT-Bench Meta AI +5 more

7arXiv · cs.LG·26d ago·source ↗

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

Complete-muE is a framework for transferring hyperparameters across dense FFN and Mixture-of-Experts (MoE) transformer architectures, addressing limitations of existing tools like μP and SDE that cannot handle simultaneous architecture and token-per-expert changes. It uses a two-bridge system: Bridge I maps dense FFN to Dense MoE via active-width μP with normalized router scale, and Bridge II maps Dense MoE to sparse MoE via activated-expert scaling with a first-order SDE correction. The practical outcome is a 'tune dense once, transfer to all' recipe that enables near-optimal hyperparameter reuse across MoE configurations without costly re-tuning. Experiments on language model and diffusion model pretraining confirm stable hyperparameter optima across architectures and parameter counts.

Training Infrastructure Frontier Model Releases Transformers Mixture of Experts SDE (Stochastic Differential Equation LR scaling)+3 more

9Anthropic News·23d ago·source ↗

Anthropic raises $65B in Series H funding at $965B post-money valuation

Anthropic has closed a $65 billion Series H round led by Altimeter Capital, Dragoneer, Greenoaks, and Sequoia Capital, valuing the company at $965 billion post-money. The company reports annualized run-rate revenue crossing $47 billion and highlights major compute expansion agreements with Amazon (up to 5 GW), Google/Broadcom (5 GW of TPU capacity), and SpaceX (Colossus GPU access). Strategic infrastructure partners Micron, Samsung, and SK hynix join the round alongside a broad syndicate of institutional investors. Funding is earmarked for safety and interpretability research, compute scaling, and product expansion including Claude Code and Cowork.

Training Infrastructure Frontier Model Releases Google Cloud Alfred Lin Claude Opus 4.6 +21 more

Training Infrastructure

Related entities

Related topics (8)

Guides (1)

Training Infrastructure: The Compute Arms Race Powering Modern AI

Recent events (50)