Learning path

Alignment and RLHF: How AI learns to behave

How do you take a raw language model and make it helpful, honest, and safe? This path traces the ideas and techniques behind AI alignment — from the foundational concept of reinforcement learning, through the human-feedback methods that shaped today's assistants, to the newer algorithms pushing the field forward. It ends with a look at the labs and tools doing this work in practice.

Suitable for readers who know roughly what a language model is and want to understand the alignment layer on top of it. Steps build on each other, so read in order.

Mixed level9 steps~56 min

9 steps

Begin →

large language models
Start here: a grounding in what large language models are, so the alignment techniques that follow have something concrete to attach to.
Read →Beginner In-depth
Reinforcement Learning
Alignment methods borrow heavily from RL — understanding reward signals and policy optimization makes every subsequent step click.
Read →Beginner In-depth
Direct Preference Optimization (DPO)
A leaner alternative to RLHF that skips the reward model entirely — understanding it sharpens your picture of what RLHF is actually doing.
Read →Beginner In-depth
GRPO
A more recent RL algorithm designed to be more stable and compute-efficient than PPO-based RLHF — the current direction of the field.
Read →Beginner In-depth
Chain-of-Thought Reasoning
Reasoning traces are increasingly used as a training signal in alignment pipelines, so seeing how chain-of-thought works in context matters here.
Read →Beginner In-depth
LoRA
Fine-tuning aligned models on new tasks is often done with LoRA — a practical bridge between alignment research and real deployment.
Read →Beginner In-depth
OpenAI
The lab that introduced RLHF at scale and whose model lineage (InstructGPT → GPT-5.5) is the canonical case study for these techniques.
Read →Beginner In-depth
Anthropic
Anthropic was founded specifically around alignment research and developed Constitutional AI — a distinct approach worth comparing to OpenAI's.
Read →Beginner In-depth
Hugging Face
Hugging Face hosts the open-source tooling (TRL, alignment-handbook) that makes these techniques accessible to practitioners outside the big labs.
Read →Beginner In-depth

Alignment and RLHF: How AI learns to behave

Suitable for readers who know roughly what a language model is and want to understand the alignment layer on top of it. Steps build on each other, so read in order.

Mixed level9 steps~56 min

large language models

Reinforcement Learning

Direct Preference Optimization (DPO)

GRPO

Chain-of-Thought Reasoning

LoRA

OpenAI

Anthropic

Hugging Face