Almanac
Topic guide · Beginner

Training Infrastructure: The Compute Arms Race Powering Modern AI

Training InfrastructureBeginneractive·v1 · live·generated 7d ago
TL;DRBuilding a frontier AI model today requires more than clever algorithms — it demands staggering amounts of specialized hardware, purpose-built data centers, and multi-billion-dollar partnerships between AI labs and the world's biggest cloud providers. What began as a single cloud deal has grown into a web of gigawatt-scale commitments, custom silicon projects, and physical construction sites, reshaping how AI is funded, built, and delivered.

Key takeaways

  • OpenAI's Stargate initiative targets up to $500 billion in U.S. AI infrastructure over four years, with a 1 GW data center already breaking ground in Michigan.
  • Anthropic has assembled a multi-cloud compute portfolio spanning Amazon Trainium, Google TPUs (up to 1 million units), NVIDIA GPUs, and SpaceX's Colossus cluster (220,000+ GPUs, 300+ MW).
  • OpenAI is diversifying beyond NVIDIA with partnerships targeting 10 GW of OpenAI-designed accelerators via Broadcom and 6 GW of AMD Instinct GPUs.
  • The foundational scaling laws paper (OpenAI, 2020) established that model performance improves predictably with compute, data, and parameters — the intellectual bedrock justifying every gigawatt commitment since.
  • Newer research (Shannon Scaling Law, hyperparameter transfer tools) is refining how labs plan and execute training runs, reducing wasted compute.
  • Physical infrastructure is now a geopolitical asset: Iranian drone strikes on AWS data centers in 2026 marked the first known targeting of commercial cloud infrastructure during active conflict.

What training infrastructure is

When you ask an AI a question, the response feels instant — but behind it lies months of preparation: a "training run" in which a model learns from vast amounts of text and data. That process is extraordinarily hungry for computing power. Training infrastructure is the full stack that makes it possible: the specialized chips (GPUs, TPUs, custom accelerators), the data centers that house them, the software that coordinates thousands of chips working in parallel, and the cloud partnerships and funding deals that pay for all of it.

Why it matters to everyone, not just engineers

The scale of infrastructure a lab can access sets a ceiling on the capability of the models it can build. This is not just a technical detail — it shapes which companies can compete at the frontier, how quickly AI improves, where the physical hardware sits (and who controls it), and even how much it costs to use AI products. The billions of dollars flowing into data centers today are, in a real sense, bets on what AI will be able to do in two or three years.

The intellectual foundation: scaling laws

The modern infrastructure race has a clear intellectual starting point. In 2020, OpenAI published research establishing what are now called scaling laws: the finding that AI model quality improves in a predictable, mathematical way as you add more compute, more data, and more model parameters. This gave labs a principled reason to keep building bigger — if you can afford more compute, you can forecast roughly how much better your model will get. That paper became the intellectual justification for every gigawatt commitment that followed.

More recent research is refining the picture. The Shannon Scaling Law (2026) proposes a new framework borrowed from information theory that can explain phenomena the original scaling laws missed — like why a model can be overtrained into worse performance. Other work on hyperparameter transfer (how to tune a small model and reliably apply those settings to a much larger one) is helping labs avoid wasting expensive compute on poorly configured runs.

The infrastructure landscape today

The cloud partnership model

The dominant pattern is a deep partnership between an AI lab and one or more major cloud providers. The lab gets access to chips and data center capacity quickly; the cloud provider gets a major customer and, often, an equity stake.

Anthropic has built the most diversified compute portfolio. Its primary training partner is Amazon Web Services, with a commitment of up to 5 gigawatts of capacity on Amazon's custom Trainium chips over ten years — a deal worth over $100 billion. Alongside that, Anthropic has agreements with Google and Broadcom for multiple gigawatts of TPU (Google's custom AI chip) capacity, a $30 billion Azure compute commitment with Microsoft paired with up to 1 GW of NVIDIA's latest hardware, and access to SpaceX's Colossus data center — over 220,000 NVIDIA GPUs drawing more than 300 megawatts of power. Anthropic also committed $50 billion to build its own purpose-built data centers in Texas and New York with partner Fluidstack.

OpenAI started with Microsoft as its exclusive cloud provider — a relationship that began with a $1 billion investment in 2019 and grew into the backbone of GPT model training on Azure. More recently, OpenAI has diversified aggressively: a $100 billion Trainium commitment with Amazon, a 10 GW partnership with NVIDIA, a 6 GW deal with AMD for Instinct GPUs, and a 10 GW custom silicon project with Broadcom targeting OpenAI-designed accelerators by 2029. The Stargate initiative, announced in January 2025, is the umbrella for OpenAI's U.S. infrastructure ambitions — up to $500 billion over four years, with a 1 GW data center already breaking ground in Michigan.

Custom silicon: reducing dependence on NVIDIA

A notable trend is the push by both labs and cloud providers to develop chips that are purpose-built for AI training, rather than relying entirely on NVIDIA's GPUs. Amazon's Trainium line, Google's TPUs, and OpenAI's Broadcom collaboration all reflect the same logic: at gigawatt scale, even small efficiency gains per chip translate into enormous cost savings and competitive advantages. Mistral AI, a smaller European lab, trained its Mistral Large 3 model on 3,000 NVIDIA H200 GPUs — a reminder that not every player is operating at the same scale, and that efficiency matters as much as raw size.

Efficiency research: doing more with less

Not every advance in training infrastructure is about building more. A parallel track of research focuses on getting more out of existing compute. Mixture-of-Experts (MoE) architectures — used in models like DeepSeek V3.2 and Mistral Large 3 — activate only a fraction of a model's parameters for any given input, dramatically cutting the compute needed per query. DeepSeek's sparse attention architecture achieved efficiency gains significant enough to justify a 50%+ API price cut. Research on hyperparameter transfer (including the Complete-muE framework for MoE models) helps labs avoid expensive trial-and-error when scaling up. And work on RELEX shows that reinforcement learning training trajectories are so predictable that you can extrapolate a model's final performance after seeing only 15% of its training steps — potentially saving enormous amounts of compute.

Infrastructure as geopolitics

The scale of these investments has made AI infrastructure a matter of national interest — and physical vulnerability. In March 2026, Iranian drone strikes damaged at least three AWS data centers in Bahrain and the UAE, disrupting cloud services across the region. The episode was the first known targeting of commercial cloud infrastructure during active conflict, and it underscored that the data centers powering AI are physical objects in the real world, subject to the same risks as any other critical infrastructure.

Both Anthropic and OpenAI have framed their U.S. data center investments partly in terms of domestic AI leadership, aligning with government priorities around keeping frontier AI infrastructure on American soil.

Where it's heading

The trajectory is clear: more gigawatts, more custom silicon, more geographic diversification, and more vertical integration as labs move from renting compute to building their own facilities. The open research questions are about efficiency — whether new architectures, better scaling laws, and smarter training recipes can let smaller players stay competitive, or whether the infrastructure gap between the largest labs and everyone else will simply keep widening.

Anthropic's compute supply chain (as of mid-2026)

Major AI Infrastructure Commitments (from the events bundle)

Party AParty BScale / CommitmentChip / Platform
AnthropicAmazon (AWS)Up to 5 GW, $100B+ over 10 yearsTrainium2–4
AnthropicGoogle / BroadcomMulti-GW TPU capacity (online 2027)Google TPUs
AnthropicMicrosoft / NVIDIA$30B Azure compute + up to 1 GWGrace Blackwell / Vera Rubin
AnthropicSpaceX (Colossus)300+ MW, 220,000+ GPUsNVIDIA GPUs
AnthropicFluidstack$50B, custom data centers TX & NY
OpenAIAmazon (AWS)$100B Trainium over 8 yearsAmazon Trainium
OpenAINVIDIA10 GW datacenter capacityNVIDIA systems
OpenAIBroadcom10 GW custom AI accelerators by 2029OpenAI-designed silicon
OpenAIAMD6 GW AMD Instinct GPUsAMD Instinct

All figures from the events bundle; unknown cells render —.

Timeline

  1. Microsoft invests $1B in OpenAI, becomes exclusive cloud provider on Azure

  2. OpenAI publishes scaling laws: compute, data, and parameters predict model quality

  3. Amazon invests up to $4B in Anthropic; AWS becomes primary cloud and training partner

  4. OpenAI announces Stargate: up to $500B in U.S. AI infrastructure over four years

  5. Anthropic commits $50B to U.S. data centers with Fluidstack; Microsoft/NVIDIA partnerships announced

  6. Anthropic–Amazon deal expands to 5 GW and $100B+; Anthropic–Google multi-GW TPU deal signed

  7. Anthropic accesses SpaceX Colossus (220,000+ GPUs); OpenAI breaks ground on 1 GW Michigan data center

Related topics

FAQ

Why do AI models need so much computing power?

Training a frontier AI model means processing enormous amounts of text and data through billions of mathematical operations, repeated trillions of times. The foundational research showing that more compute reliably produces better models has driven labs to keep scaling up.

What is a 'gigawatt' of compute, and why does that unit matter?

A gigawatt (GW) is a measure of electrical power — the same unit used for power plants. AI labs now describe their data center ambitions in gigawatts because the electricity demand of running thousands of chips 24/7 is the real bottleneck, not just the number of servers.

Why are AI labs signing deals with cloud providers instead of building everything themselves?

Building and operating data centers at this scale takes years and tens of billions of dollars. Cloud partnerships let labs access chips and power quickly while the providers handle construction, cooling, and networking — though the largest labs are now also building their own facilities.

What is Stargate?

Stargate is OpenAI's infrastructure initiative, announced in January 2025, targeting up to $500 billion in U.S. AI compute investment over four years, including a 1 GW data center already under construction in Michigan.

Is all this infrastructure just for training, or does it also run the AI products people use?

Both — the same cloud partnerships cover training new models and serving (running) them for users. Anthropic, for example, used its expanded compute to double Claude Code rate limits and remove peak-hour restrictions for subscribers.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v1live7d ago

Related guides (4)

More on Training Infrastructure (6)

9Anthropic News·1mo ago·source ↗

Anthropic and Amazon Expand Collaboration for Up to 5 Gigawatts of New Compute

Anthropic has signed a major expanded agreement with Amazon committing over $100 billion to AWS technologies over ten years, securing up to 5GW of compute capacity for training and deploying Claude across Trainium2 through Trainium4 chips. Amazon is investing an additional $5 billion in Anthropic today, with up to $20 billion more possible in the future, building on $8 billion previously invested. The deal includes nearly 1GW of Trainium2 and Trainium3 capacity coming online by end of 2026, expanded inference in Asia and Europe, and the full Claude Platform becoming available directly within AWS. Anthropic disclosed its run-rate revenue has surpassed $30 billion, up from approximately $9 billion at end of 2025.

8Anthropic News·1mo ago·source ↗

Anthropic Expands Partnership with Google and Broadcom for Multi-Gigawatt TPU Compute Capacity

Anthropic has signed a new agreement with Google and Broadcom for multiple gigawatts of next-generation TPU capacity expected to come online starting in 2027, representing the company's largest compute commitment to date. The announcement coincides with Anthropic reporting run-rate revenue surpassing $30 billion, up from ~$9 billion at end of 2025, and the number of enterprise customers spending over $1M annually doubling to 1,000+ in under two months. The compute will be predominantly US-sited, extending Anthropic's November 2025 $50B American infrastructure commitment. Anthropic continues to operate across AWS Trainium, Google TPUs, and NVIDIA GPUs, with Amazon remaining its primary cloud and training partner.

6Openai Blog·1mo ago·source ↗

Building the compute infrastructure for the Intelligence Age

OpenAI is scaling its Stargate initiative to expand compute infrastructure aimed at supporting AGI development. The announcement describes new data center capacity additions to meet growing AI demand. This represents a continuation of OpenAI's large-scale infrastructure buildout strategy under the Stargate program.

8Anthropic News·1mo ago·source ↗

Anthropic Announces SpaceX Colossus Compute Deal and Higher Claude Usage Limits

Anthropic has signed an agreement with SpaceX to access the full compute capacity of the Colossus 1 data center, gaining over 300 megawatts and 220,000+ NVIDIA GPUs within a month. This deal, combined with prior agreements with Amazon, Google/Broadcom, Microsoft/NVIDIA, and Fluidstack, enables Anthropic to double Claude Code rate limits, remove peak-hour restrictions for Pro/Max users, and raise API rate limits for Claude Opus models. The announcement also notes interest in developing orbital AI compute capacity with SpaceX, and outlines international infrastructure expansion for enterprise compliance needs.

6arXiv · cs.LG·1mo ago·source ↗

RRFP: A Readiness-Driven Runtime for Pipeline-Parallel Training Under Runtime Variability

The paper introduces Runtime-Readiness-First Pipeline (RRFP), a new runtime for pipeline-parallel large-model training that treats schedules as non-binding hint orders rather than strict execution sequences. By combining message-driven asynchronous communication, lightweight tensor-parallel coordination, and ready-set arbitration, RRFP dynamically dispatches work based on actual task readiness, reducing idle bubbles and stage misalignment. Implemented on a Megatron-based framework and evaluated at up to 128 GPUs, RRFP achieves up to 1.77× speedup on language-only workloads and 2.77× on multimodal workloads versus fixed-order baselines, and outperforms the fastest comparable external system by up to 1.84×.

6Hugging Face Blog·1mo ago·source ↗

The Technology Behind BLOOM Training

This Hugging Face blog post details the infrastructure and training methodology used to train BLOOM, a 176-billion parameter open-access multilingual language model. It covers the use of Megatron-DeepSpeed for distributed training across hundreds of GPUs, including tensor parallelism, pipeline parallelism, and data parallelism strategies. The post also discusses hardware setup, memory optimization techniques, and lessons learned during the large-scale training run.