Topic guide · In-depth

Training Infrastructure: The Gigawatt Race Reshaping AI's Hardware Foundation

Training InfrastructureIn-depthactive·v1 · live·generated 7d ago

TL;DRAI training infrastructure has undergone a phase transition — from cloud partnerships measured in billions of dollars to sovereign compute commitments measured in gigawatts, with frontier labs locking in decade-long hardware deals that dwarf any prior technology buildout. The central tension is no longer whether to scale, but how to diversify across chip vendors and cloud providers while the underlying science of scaling itself is being refined and challenged by new theoretical frameworks.

Key takeaways

Anthropic has assembled over 10 GW of committed compute across Amazon Trainium (5 GW), Google/Broadcom TPUs (multi-GW, online 2027), NVIDIA/Microsoft Azure (1 GW), SpaceX Colossus (300 MW / 220,000+ GPUs), and Fluidstack custom data centers in Texas and New York.
OpenAI's Stargate project targets up to $500 billion in U.S. AI infrastructure over four years, with a 1 GW Michigan data center already breaking ground; separate deals with NVIDIA (10 GW), AMD (6 GW), and Broadcom (10 GW of custom accelerators) signal aggressive chip-vendor diversification.
The foundational 2020 OpenAI scaling-laws paper established power-law relationships between compute, data, and parameters — but the 2026 Shannon Scaling Law proposes an SNR-based capacity limit that explains non-monotonic phenomena like catastrophic overtraining that classical laws cannot.
Hardware-software co-development has become a competitive moat: Anthropic engineers write low-level kernels and contribute to the AWS Neuron stack for Trainium, while OpenAI and Broadcom are co-designing next-generation AI accelerator architectures.
Hyperparameter transfer research (μP, Complete-muE, embedding-LR analysis) is maturing into a practical discipline that lets labs tune at small scale and extrapolate to frontier runs — directly reducing the cost of large training experiments.
Iranian drone strikes on AWS data centers in Bahrain and the UAE in March 2026 marked the first known targeting of commercial cloud infrastructure during active conflict, introducing physical-security risk as a new variable in infrastructure planning.

What this area covers

Training infrastructure encompasses the full hardware and systems stack required to pre-train frontier AI models: the chips (GPUs, TPUs, custom accelerators), the data centers that house them, the cloud and co-location agreements that provision them, the distributed training software that coordinates them, and the scaling science that determines how to allocate resources across compute, data, and model size. It is the physical substrate on which every capability advance in AI ultimately depends.

Why it matters

Infrastructure is the binding constraint on what models can be built and when. A lab that cannot secure sustained access to hundreds of thousands of accelerators cannot run the training experiments needed to stay at the frontier. Conversely, a lab that locks in multi-gigawatt capacity years in advance gains a structural advantage that is difficult to replicate quickly. The deals in this bundle are not procurement events — they are strategic bets on which hardware architectures and cloud relationships will define the next generation of models.

The foundational science: scaling laws and their limits

The intellectual foundation of modern training infrastructure strategy is the 2020 OpenAI scaling laws paper, which established empirical power-law relationships between model performance and three variables: compute, data, and parameters. This gave labs a principled framework for deciding how large to train and how much data to use — and implicitly justified the multi-billion-dollar training runs that followed.

That framework is now being refined. The 2026 Shannon Scaling Law proposes modeling LLM training as information transmission over a noisy channel, using the Shannon-Hartley theorem to derive an SNR-based capacity limit. Validated on Pythia and OLMo2 models trained on up to 307B tokens, it explains non-monotonic phenomena — catastrophic overtraining, quantization-induced degradation — that classical power-law scaling cannot capture, and successfully extrapolates from 6.9B to 12B parameter models. Whether this framework supplants or supplements the classical laws remains an open question, but it signals that the science of scaling is not settled.

The gigawatt buildout: Anthropic's compute stack

Anthropic has assembled the most publicly documented multi-vendor compute portfolio in the industry. Its primary training relationship is with Amazon, anchored by a 10-year, $100B+ commitment securing up to 5 GW of Trainium2 through Trainium4 capacity, with nearly 1 GW of Trainium2 and Trainium3 online by end of 2026. This is not a passive procurement deal: Anthropic engineers write low-level kernels and contribute to the AWS Neuron software stack, making the relationship a hardware-software co-development partnership.

Supplementing Amazon, Anthropic has signed a multi-gigawatt Google/Broadcom TPU deal (capacity online from 2027, described as its largest compute commitment to date), a 1 GW NVIDIA Grace Blackwell/Vera Rubin commitment via a $30B Azure compute purchase from Microsoft, and access to SpaceX's Colossus 1 data center — over 300 MW and 220,000+ NVIDIA GPUs. A $50B commitment to Fluidstack for purpose-built data centers in Texas and New York rounds out the domestic footprint. The company has also expressed interest in orbital compute capacity with SpaceX, though no capacity figures are attached to that aspiration.

The practical effect of this stack is visible in product terms: the SpaceX Colossus deal alone enabled Anthropic to double Claude Code rate limits and remove peak-hour restrictions for Pro and Max users.

The gigawatt buildout: OpenAI's Stargate and chip diversification

OpenAI's infrastructure strategy centers on the Stargate Project, a joint venture targeting up to $500 billion in U.S. AI infrastructure over four years. A 1 GW data center in Michigan has broken ground. Beyond Stargate, OpenAI has pursued aggressive chip-vendor diversification: a 10 GW datacenter partnership with NVIDIA (Phase 1 in 2026), a 6 GW AMD Instinct GPU deployment (1 GW in 2026), and a multi-year Broadcom collaboration targeting 10 GW of OpenAI-designed custom AI accelerators by 2029. The Broadcom deal is particularly significant — it represents OpenAI's push into custom silicon, reducing structural dependence on NVIDIA.

On the cloud side, OpenAI's exclusive Microsoft Azure relationship has loosened. A $38B multi-year AWS partnership and a subsequent $100B Trainium compute commitment over 8 years (with a $15B Amazon investment) now make AWS the exclusive third-party cloud for OpenAI Frontier's stateful runtime environments — a legal distinction that preserves Microsoft's exclusive rights to stateless API calls while opening a second major cloud relationship.

Hardware-software co-development as competitive moat

A recurring pattern across the bundle is that the most durable infrastructure advantages come not from purchasing capacity but from co-designing the hardware. Anthropic's Trainium kernel work, OpenAI's Broadcom custom accelerator co-development, and NVIDIA's co-optimization of future architectures for Anthropic workloads all reflect the same insight: at frontier scale, the gap between generic hardware performance and workload-optimized performance is large enough to matter competitively.

Mistral's release of Mistral Large 3 — trained on 3,000 NVIDIA H200 GPUs with deep co-optimization for Blackwell/Hopper kernels and NVFP4 format — illustrates that even smaller labs are pursuing hardware-software alignment, though at a different scale.

The systems science layer: distributed training and hyperparameter transfer

Below the infrastructure deals, a quieter but practically important body of research is maturing. Hyperparameter transfer — the ability to find optimal training hyperparameters at small scale and apply them to large runs — directly reduces the cost of frontier experiments.

The Maximal Update Parameterization (μP) framework has become a standard reference point. New work in this bundle shows that much of μP's benefit over standard parameterization with AdamW reduces to a single factor: the embedding layer learning rate. In standard parameterization, the embedding layer acts as a training bottleneck; scaling its learning rate by model width to match μP substantially stabilizes training and improves transfer across scales. Complete-muE extends this logic to Mixture-of-Experts architectures, providing a "tune dense once, transfer to all" recipe that handles simultaneous architecture and token-per-expert changes — a gap that existing tools like μP and SDE could not address.

RELEX offers a complementary efficiency gain on the post-training side: by observing that RLVR weight update trajectories are extremely low-rank and near-linearly predictable, it can extrapolate future checkpoints from as few as 15% of training steps, matching or exceeding full RLVR performance on Qwen2.5 and Qwen3 models.

Token-level proxy metrics for forecasting downstream performance — using entropy, top-k accuracy, and expert token rank from a candidate model's next-token distribution — achieve mean Spearman Rho of 0.81 versus 0.36 for cross-entropy loss on model ranking, and reduce compute for data selection by roughly 10,000×. These tools collectively make the expensive process of large-scale training more legible and less wasteful.

Physical security as a new infrastructure variable

The March 2026 Iranian drone strikes on AWS data centers in Bahrain and the UAE introduced a risk dimension that had been largely theoretical: kinetic attacks on commercial cloud infrastructure during active conflict. At least three facilities were damaged, disrupting cloud services across the region. The episode coincided with revelations that Claude, integrated with Palantir's Maven Smart System, had been used in U.S. military targeting operations — compressing a 12-hour targeting process to under one minute. The combination of AI systems being used in active conflict and the physical infrastructure supporting those systems being targeted represents a new category of infrastructure risk that data center siting and redundancy planning will need to account for.

Where the frontier is heading

The trajectory in this bundle points in three directions simultaneously. First, the absolute scale of committed compute will continue to grow — the multi-gigawatt deals announced in 2025–2026 are for capacity coming online in 2026–2027, and the labs signing them are already planning the next generation. Second, custom silicon will become more central: both OpenAI (via Broadcom) and Anthropic (via Trainium co-development) are moving up the hardware stack, and the efficiency gains from workload-specific architectures will compound over time. Third, the science of scaling is becoming more sophisticated — the Shannon Scaling Law, hyperparameter transfer frameworks, and proxy metrics for downstream performance all point toward a discipline that can extract more signal from each training dollar, even as the absolute number of dollars grows.

The binding constraint is shifting from "can we afford to train at this scale" to "can we build and operate the physical infrastructure fast enough" — a problem that is as much about construction timelines, power grid access, and geopolitical stability as it is about software or algorithms.

Anthropic's multi-vendor compute stack (as of mid-2026)

Training infrastructure deal timeline: key inflection points

Major compute commitments by lab (from events bundle)

Lab	Partner	Committed capacity	Timeline	Notable terms
Anthropic	Amazon (Trainium2–4)	Up to 5 GW	~1 GW online by end 2026	$100B+ over 10 years; primary training partner
Anthropic	Google / Broadcom (TPU)	Multi-GW	Online from 2027	Up to 1M TPUs; largest Anthropic commitment to date
Anthropic	Microsoft / NVIDIA (Azure + Grace Blackwell/Vera Rubin)	Up to 1 GW	—	$30B Azure purchase; $10B NVIDIA + $5B Microsoft investment
Anthropic	SpaceX Colossus	300 MW / 220,000+ GPUs	Within one month of deal	Interest in orbital compute; doubled Claude Code rate limits
Anthropic	Fluidstack (custom DCs)	$50B committed	Throughout 2026	Texas and New York sites; ~3,200 jobs
OpenAI	NVIDIA	10 GW datacenter capacity	Phase 1 in 2026	Strategic partnership
OpenAI	AMD (Instinct GPUs)	6 GW	1 GW in 2026	Diversification beyond NVIDIA
OpenAI	Broadcom (custom accelerators)	10 GW	By 2029	OpenAI-designed chips; Ethernet networking co-dev

All figures from the events bundle; unknown cells render —. Anthropic's Amazon deal is its primary training relationship; others are supplementary or inference-focused.

Timeline

FAQ

Why are AI labs signing decade-long compute deals instead of buying capacity on-demand?

Frontier training runs require sustained, predictable access to hundreds of thousands of accelerators simultaneously — a demand profile that spot or on-demand cloud markets cannot reliably satisfy. Long-term contracts also give labs leverage to co-design chips (e.g., Anthropic's Trainium kernel work, OpenAI's Broadcom custom accelerators) and lock in pricing before demand drives costs higher.

What is the significance of the Shannon Scaling Law relative to the original OpenAI scaling laws?

The 2020 OpenAI scaling laws describe smooth power-law improvements with compute, data, and parameters, but cannot explain phenomena like catastrophic overtraining or quantization-induced degradation. The 2026 Shannon Scaling Law reframes training as information transmission over a noisy channel, introducing an SNR-based capacity limit that predicts U-shaped performance degradation when signal-to-noise ratio is insufficient — capturing failure modes the classical framework misses.

How does hyperparameter transfer research reduce training costs at scale?

Methods like μP and Complete-muE allow labs to find optimal learning rates and other hyperparameters on small, cheap proxy models and transfer them to large runs without re-tuning — the 'tune dense once, transfer to all' recipe. Research in this bundle also shows that much of μP's benefit reduces to a single factor: scaling the embedding layer's learning rate, which stabilizes training across model widths.

Is Anthropic's compute strategy concentrated on one vendor?

No — Anthropic explicitly maintains a diversified stack across Amazon Trainium (primary training), Google TPUs, and NVIDIA GPUs, with SpaceX Colossus and Fluidstack custom data centers as additional capacity. Amazon remains the primary cloud and training partner, but the multi-vendor approach hedges against supply, geopolitical, and technical risk.

What physical-security risk did the March 2026 AWS strikes introduce?

Iranian drone strikes damaged at least three AWS data centers in Bahrain and the UAE, disrupting cloud services across the region. This was the first known kinetic attack on commercial cloud infrastructure during active conflict, adding geographic and geopolitical exposure as a new variable in data center siting decisions — particularly for labs with government and defense customers.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

v1live7d ago

Related guides (4)

Training InfrastructureTopic guide

Training Infrastructure: The Compute Arms Race Powering Modern AI

Read asBeginner

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From LLM Demo to Production Reality

Read asIn-depth

More on Training Infrastructure (6)

9Anthropic News·1mo ago·source ↗

Anthropic and Amazon Expand Collaboration for Up to 5 Gigawatts of New Compute

Anthropic has signed a major expanded agreement with Amazon committing over $100 billion to AWS technologies over ten years, securing up to 5GW of compute capacity for training and deploying Claude across Trainium2 through Trainium4 chips. Amazon is investing an additional $5 billion in Anthropic today, with up to $20 billion more possible in the future, building on $8 billion previously invested. The deal includes nearly 1GW of Trainium2 and Trainium3 capacity coming online by end of 2026, expanded inference in Asia and Europe, and the full Claude Platform becoming available directly within AWS. Anthropic disclosed its run-rate revenue has surpassed $30 billion, up from approximately $9 billion at end of 2025.

Training Infrastructure Frontier Model Releases Dario Amodei Claude Platform Amazon Bedrock +9 more

8Anthropic News·1mo ago·source ↗

Anthropic Expands Partnership with Google and Broadcom for Multi-Gigawatt TPU Compute Capacity

Anthropic has signed a new agreement with Google and Broadcom for multiple gigawatts of next-generation TPU capacity expected to come online starting in 2027, representing the company's largest compute commitment to date. The announcement coincides with Anthropic reporting run-rate revenue surpassing $30 billion, up from ~$9 billion at end of 2025, and the number of enterprise customers spending over $1M annually doubling to 1,000+ in under two months. The compute will be predominantly US-sited, extending Anthropic's November 2025 $50B American infrastructure commitment. Anthropic continues to operate across AWS Trainium, Google TPUs, and NVIDIA GPUs, with Amazon remaining its primary cloud and training partner.

Training Infrastructure Frontier Model Releases Google TPU Broadcom Claude +10 more

6Openai Blog·1mo ago·source ↗

Building the compute infrastructure for the Intelligence Age

OpenAI is scaling its Stargate initiative to expand compute infrastructure aimed at supporting AGI development. The announcement describes new data center capacity additions to meet growing AI demand. This represents a continuation of OpenAI's large-scale infrastructure buildout strategy under the Stargate program.

Training Infrastructure Inference Economics Stargate OpenAI +1 more

8Anthropic News·1mo ago·source ↗

Anthropic Announces SpaceX Colossus Compute Deal and Higher Claude Usage Limits

Anthropic has signed an agreement with SpaceX to access the full compute capacity of the Colossus 1 data center, gaining over 300 megawatts and 220,000+ NVIDIA GPUs within a month. This deal, combined with prior agreements with Amazon, Google/Broadcom, Microsoft/NVIDIA, and Fluidstack, enables Anthropic to double Claude Code rate limits, remove peak-hour restrictions for Pro/Max users, and raise API rate limits for Claude Opus models. The announcement also notes interest in developing orbital AI compute capacity with SpaceX, and outlines international infrastructure expansion for enterprise compliance needs.

Training Infrastructure Frontier Model Releases Google TPU Claude Opus 4.6 Microsoft +12 more

6arXiv · cs.LG·1mo ago·source ↗

RRFP: A Readiness-Driven Runtime for Pipeline-Parallel Training Under Runtime Variability

The paper introduces Runtime-Readiness-First Pipeline (RRFP), a new runtime for pipeline-parallel large-model training that treats schedules as non-binding hint orders rather than strict execution sequences. By combining message-driven asynchronous communication, lightweight tensor-parallel coordination, and ready-set arbitration, RRFP dynamically dispatches work based on actual task readiness, reducing idle bubbles and stage misalignment. Implemented on a Megatron-based framework and evaluated at up to 128 GPUs, RRFP achieves up to 1.77× speedup on language-only workloads and 2.77× on multimodal workloads versus fixed-order baselines, and outperforms the fastest comparable external system by up to 1.84×.

Training Infrastructure Inference Economics tensor parallelism pipeline parallelism BFW schedule hint +2 more

6Hugging Face Blog·1mo ago·source ↗

The Technology Behind BLOOM Training

This Hugging Face blog post details the infrastructure and training methodology used to train BLOOM, a 176-billion parameter open-access multilingual language model. It covers the use of Megatron-DeepSpeed for distributed training across hundreds of GPUs, including tensor parallelism, pipeline parallelism, and data parallelism strategies. The post also discusses hardware setup, memory optimization techniques, and lessons learned during the large-scale training run.

Training Infrastructure Open Weights Progress BLOOM DeepSpeed Hugging Face +2 more

Training Infrastructure: The Gigawatt Race Reshaping AI's Hardware Foundation

Key takeaways

What this area covers

Why it matters

The foundational science: scaling laws and their limits

The gigawatt buildout: Anthropic's compute stack

The gigawatt buildout: OpenAI's Stargate and chip diversification

Hardware-software co-development as competitive moat

The systems science layer: distributed training and hyperparameter transfer

Physical security as a new infrastructure variable

Where the frontier is heading

Anthropic's multi-vendor compute stack (as of mid-2026)

Training infrastructure deal timeline: key inflection points

Major compute commitments by lab (from events bundle)

Timeline

Related topics

FAQ

Stay current

Versions

Related guides (4)

Training Infrastructure: The Compute Arms Race Powering Modern AI

Alignment and RLHF: Teaching AI Models to Behave

Inference Economics: The Cost of Running AI in Production

Enterprise Deployment Patterns: From LLM Demo to Production Reality

More on Training Infrastructure (6)

Anthropic and Amazon Expand Collaboration for Up to 5 Gigawatts of New Compute

Anthropic Expands Partnership with Google and Broadcom for Multi-Gigawatt TPU Compute Capacity

Building the compute infrastructure for the Intelligence Age

Anthropic Announces SpaceX Colossus Compute Deal and Higher Claude Usage Limits

RRFP: A Readiness-Driven Runtime for Pipeline-Parallel Training Under Runtime Variability

The Technology Behind BLOOM Training