Almanac

Learning path

Training Infrastructure: The Stack Behind Modern AI

Who builds the hardware, clouds, tools, and models that make large-scale AI training possible? This path traces the full stack — from the chips and cloud platforms that supply raw compute, to the frameworks that make it usable, to the labs and architectures that push the frontier. Each step adds a layer, so read them in order for the clearest picture.

Mixed level8 steps~42 min

8 steps

Begin →
  1. NVIDIA

    Start at the bottom of the stack: NVIDIA's GPUs are the dominant compute substrate on which almost all large model training runs.

  2. Amazon Web Services

    With the hardware in mind, this covers how cloud providers like AWS rent that compute at scale — the layer most labs actually use to train.

  3. Microsoft

    Microsoft's deep Azure and OpenAI partnership shows how a hyperscaler builds dedicated training infrastructure around a frontier lab.

  4. Hugging Face

    Hugging Face is the central tooling and model-hub layer that practitioners use to manage, share, and fine-tune models on top of that infrastructure.

  5. Mixture of Experts

    Mixture of Experts is the architectural choice that lets labs train far larger models without proportionally larger compute budgets — a key infrastructure-level design decision.

  6. OpenAI

    OpenAI's history illustrates how a lab's training infrastructure ambitions — from GPT-3 to GPT-4 — shaped the entire field's compute expectations.

  7. Anthropic

    Anthropic offers a contrasting case: a safety-focused lab that built its own training practices and infrastructure choices around Constitutional AI and large-scale RLHF.

  8. DeepSeek V4

    DeepSeek V4 is the frontier example of training efficiency — a model whose infrastructure story challenged assumptions about how much compute a top-tier model requires.