What this area covers
Training infrastructure encompasses the full hardware and systems stack required to pre-train frontier AI models: the chips (GPUs, TPUs, custom accelerators), the data centers that house them, the cloud and co-location agreements that provision them, the distributed training software that coordinates them, and the scaling science that determines how to allocate resources across compute, data, and model size. It is the physical substrate on which every capability advance in AI ultimately depends.
Why it matters
Infrastructure is the binding constraint on what models can be built and when. A lab that cannot secure sustained access to hundreds of thousands of accelerators cannot run the training experiments needed to stay at the frontier. Conversely, a lab that locks in multi-gigawatt capacity years in advance gains a structural advantage that is difficult to replicate quickly. The deals in this bundle are not procurement events — they are strategic bets on which hardware architectures and cloud relationships will define the next generation of models.
The foundational science: scaling laws and their limits
The intellectual foundation of modern training infrastructure strategy is the 2020 OpenAI scaling laws paper, which established empirical power-law relationships between model performance and three variables: compute, data, and parameters. This gave labs a principled framework for deciding how large to train and how much data to use — and implicitly justified the multi-billion-dollar training runs that followed.
That framework is now being refined. The 2026 Shannon Scaling Law proposes modeling LLM training as information transmission over a noisy channel, using the Shannon-Hartley theorem to derive an SNR-based capacity limit. Validated on Pythia and OLMo2 models trained on up to 307B tokens, it explains non-monotonic phenomena — catastrophic overtraining, quantization-induced degradation — that classical power-law scaling cannot capture, and successfully extrapolates from 6.9B to 12B parameter models. Whether this framework supplants or supplements the classical laws remains an open question, but it signals that the science of scaling is not settled.
The gigawatt buildout: Anthropic's compute stack
Anthropic has assembled the most publicly documented multi-vendor compute portfolio in the industry. Its primary training relationship is with Amazon, anchored by a 10-year, $100B+ commitment securing up to 5 GW of Trainium2 through Trainium4 capacity, with nearly 1 GW of Trainium2 and Trainium3 online by end of 2026. This is not a passive procurement deal: Anthropic engineers write low-level kernels and contribute to the AWS Neuron software stack, making the relationship a hardware-software co-development partnership.
Supplementing Amazon, Anthropic has signed a multi-gigawatt Google/Broadcom TPU deal (capacity online from 2027, described as its largest compute commitment to date), a 1 GW NVIDIA Grace Blackwell/Vera Rubin commitment via a $30B Azure compute purchase from Microsoft, and access to SpaceX's Colossus 1 data center — over 300 MW and 220,000+ NVIDIA GPUs. A $50B commitment to Fluidstack for purpose-built data centers in Texas and New York rounds out the domestic footprint. The company has also expressed interest in orbital compute capacity with SpaceX, though no capacity figures are attached to that aspiration.
The practical effect of this stack is visible in product terms: the SpaceX Colossus deal alone enabled Anthropic to double Claude Code rate limits and remove peak-hour restrictions for Pro and Max users.
The gigawatt buildout: OpenAI's Stargate and chip diversification
OpenAI's infrastructure strategy centers on the Stargate Project, a joint venture targeting up to $500 billion in U.S. AI infrastructure over four years. A 1 GW data center in Michigan has broken ground. Beyond Stargate, OpenAI has pursued aggressive chip-vendor diversification: a 10 GW datacenter partnership with NVIDIA (Phase 1 in 2026), a 6 GW AMD Instinct GPU deployment (1 GW in 2026), and a multi-year Broadcom collaboration targeting 10 GW of OpenAI-designed custom AI accelerators by 2029. The Broadcom deal is particularly significant — it represents OpenAI's push into custom silicon, reducing structural dependence on NVIDIA.
On the cloud side, OpenAI's exclusive Microsoft Azure relationship has loosened. A $38B multi-year AWS partnership and a subsequent $100B Trainium compute commitment over 8 years (with a $15B Amazon investment) now make AWS the exclusive third-party cloud for OpenAI Frontier's stateful runtime environments — a legal distinction that preserves Microsoft's exclusive rights to stateless API calls while opening a second major cloud relationship.
Hardware-software co-development as competitive moat
A recurring pattern across the bundle is that the most durable infrastructure advantages come not from purchasing capacity but from co-designing the hardware. Anthropic's Trainium kernel work, OpenAI's Broadcom custom accelerator co-development, and NVIDIA's co-optimization of future architectures for Anthropic workloads all reflect the same insight: at frontier scale, the gap between generic hardware performance and workload-optimized performance is large enough to matter competitively.
Mistral's release of Mistral Large 3 — trained on 3,000 NVIDIA H200 GPUs with deep co-optimization for Blackwell/Hopper kernels and NVFP4 format — illustrates that even smaller labs are pursuing hardware-software alignment, though at a different scale.
The systems science layer: distributed training and hyperparameter transfer
Below the infrastructure deals, a quieter but practically important body of research is maturing. Hyperparameter transfer — the ability to find optimal training hyperparameters at small scale and apply them to large runs — directly reduces the cost of frontier experiments.
The Maximal Update Parameterization (μP) framework has become a standard reference point. New work in this bundle shows that much of μP's benefit over standard parameterization with AdamW reduces to a single factor: the embedding layer learning rate. In standard parameterization, the embedding layer acts as a training bottleneck; scaling its learning rate by model width to match μP substantially stabilizes training and improves transfer across scales. Complete-muE extends this logic to Mixture-of-Experts architectures, providing a "tune dense once, transfer to all" recipe that handles simultaneous architecture and token-per-expert changes — a gap that existing tools like μP and SDE could not address.
RELEX offers a complementary efficiency gain on the post-training side: by observing that RLVR weight update trajectories are extremely low-rank and near-linearly predictable, it can extrapolate future checkpoints from as few as 15% of training steps, matching or exceeding full RLVR performance on Qwen2.5 and Qwen3 models.
Token-level proxy metrics for forecasting downstream performance — using entropy, top-k accuracy, and expert token rank from a candidate model's next-token distribution — achieve mean Spearman Rho of 0.81 versus 0.36 for cross-entropy loss on model ranking, and reduce compute for data selection by roughly 10,000×. These tools collectively make the expensive process of large-scale training more legible and less wasteful.
Physical security as a new infrastructure variable
The March 2026 Iranian drone strikes on AWS data centers in Bahrain and the UAE introduced a risk dimension that had been largely theoretical: kinetic attacks on commercial cloud infrastructure during active conflict. At least three facilities were damaged, disrupting cloud services across the region. The episode coincided with revelations that Claude, integrated with Palantir's Maven Smart System, had been used in U.S. military targeting operations — compressing a 12-hour targeting process to under one minute. The combination of AI systems being used in active conflict and the physical infrastructure supporting those systems being targeted represents a new category of infrastructure risk that data center siting and redundancy planning will need to account for.
Where the frontier is heading
The trajectory in this bundle points in three directions simultaneously. First, the absolute scale of committed compute will continue to grow — the multi-gigawatt deals announced in 2025–2026 are for capacity coming online in 2026–2027, and the labs signing them are already planning the next generation. Second, custom silicon will become more central: both OpenAI (via Broadcom) and Anthropic (via Trainium co-development) are moving up the hardware stack, and the efficiency gains from workload-specific architectures will compound over time. Third, the science of scaling is becoming more sophisticated — the Shannon Scaling Law, hyperparameter transfer frameworks, and proxy metrics for downstream performance all point toward a discipline that can extract more signal from each training dollar, even as the absolute number of dollars grows.
The binding constraint is shifting from "can we afford to train at this scale" to "can we build and operate the physical infrastructure fast enough" — a problem that is as much about construction timelines, power grid access, and geopolitical stability as it is about software or algorithms.




