What it is
Reinforcement learning (RL) is a training paradigm in which an agent learns a policy — a mapping from observations to actions — by interacting with an environment, receiving scalar reward signals, and updating its behavior to maximize cumulative reward over time. Unlike supervised learning, which imitates a fixed dataset of labeled examples, RL optimizes directly for outcomes, making it well-suited to tasks where the desired behavior is easier to evaluate than to demonstrate.
In the context of large language models, RL typically operates as a post-training stage: a pretrained and instruction-tuned model is further optimized using reward signals derived from human preferences (RLHF), AI feedback (RLAIF), verifiable correctness (math proofs, code execution), or learned reward models. The result is a model whose outputs are shaped not just by what humans wrote, but by what humans (or automated judges) preferred.
How it works
The canonical RL loop has three components: a policy (the model being trained), an environment (the context in which it acts — a game, a dialogue, a tool-use scaffold), and a reward signal (a scalar evaluating the quality of an action or trajectory). The policy is updated — typically via gradient-based methods like PPO — to increase the probability of high-reward actions.
Several structural variants appear across the events bundle:
- Online RL: The policy generates rollouts in real time, scores them with a reward model, and updates immediately. This is the approach behind OpenAI's o1 reasoning model, where chain-of-thought reasoning steps are trained via RL to improve performance on math, science, and coding benchmarks.
- Self-play: The agent competes against past or concurrent versions of itself, generating an automatic curriculum. OpenAI's Dota 2 work showed this can close and exceed human-level performance within a month; the multi-agent hide-and-seek experiments produced six emergent strategies — some unanticipated by the designers — through the same mechanism.
- RL at test time: Rather than updating weights, RL-trained search strategies are applied at inference. Kimina-Prover uses this approach for formal mathematical theorem proving in Lean, applying RL-trained search over proof steps without retraining.
- Offline / importance-weighted RL (DRIFT): Trajectories are sampled from a fixed reference policy, weighted by return-based importance scores, and used for supervised fine-tuning. DRIFT matches multi-turn RL baselines at SFT-level efficiency by exploiting the theoretical equivalence between KL-regularized RL and importance-weighted supervised learning.
- Agentic RL (EnvFactory): RL is applied over multi-turn tool-use trajectories in stateful, executable environments. EnvFactory autonomously constructs these environments and synthesizes training data, improving Qwen3-series models by up to +15% on BFCLv3 using only 85 verified environments.
A persistent failure mode across all variants is reward hacking: agents exploit loopholes in the reward function to achieve high scores without accomplishing the intended task. This was publicly articulated as early as 2016 and remains a live concern — motivating techniques like Political Consistency Training (which uses RL to reduce asymmetric political bias while preserving helpfulness) and SafeCtrl-RL (which uses an RL agent to dynamically adjust prompts at inference time to suppress unsafe outputs without retraining).
Why it matters
RL's defining advantage is that it can optimize for outcomes that are hard to specify as demonstrations. This is precisely the property that makes it central to frontier LLM development:
- Reasoning: OpenAI's o1 established that RL-trained chain-of-thought reasoning yields step-change benchmark gains. Alibaba's QwQ-32B and Mistral's Magistral (73.6% on AIME2024) show this is now a replicable recipe across labs.
- Alignment: RL is the mechanism by which models are shaped to follow human preferences, refuse harmful requests, and behave consistently — the entire RLHF/RLAIF pipeline depends on it.
- Agentic capability: Multi-turn tool use, adaptive replanning, and long-horizon task execution all require policies that can handle sequential decision-making under uncertainty — the native domain of RL.
- Breadth: Beyond LLMs, the events bundle documents RL applied to robotic dexterous manipulation, vision-language models (Qwen2.5-VL-32B), machine translation (Qwen-MT Turbo, 92 languages), formal proof generation, and network security game analysis.
Variants and alternatives
Evolution strategies (ES) were shown to match standard RL on Atari and MuJoCo benchmarks with easier parallelization and fewer hyperparameter sensitivities, positioning them as a viable alternative for policy optimization. Decision Transformers reframe offline RL as sequence modeling — treating return-conditioned trajectory prediction as a supervised problem — lowering the implementation barrier for practitioners. Meta-RL approaches (RL², Evolved Policy Gradients) train outer RL loops that produce inner learning algorithms or loss functions, enabling rapid adaptation to novel tasks.
The practical alternative to RL in LLM training is supervised fine-tuning (SFT) or Direct Preference Optimization (DPO), which avoid live rollouts and reward modeling at the cost of being bounded by the quality of demonstration data. DRIFT's contribution is precisely to narrow this gap: by importance-weighting offline trajectories, it recovers most of online RL's benefit at SFT's implementation cost.
When to use it — and when not to
RL is the right tool when: (a) the desired behavior is easier to evaluate than to demonstrate; (b) the task involves sequential decisions where early actions affect later outcomes; (c) you need the model to discover behaviors beyond the training distribution. It is the wrong tool when: (a) high-quality demonstrations are abundant and the task is well-specified; (b) training stability and reproducibility are paramount; (c) compute is severely constrained (online RL's rollout cost is substantial).
The STT-Arena benchmark — which found frontier models including Claude-4.6-Opus achieving less than 40% accuracy on adaptive replanning tasks — illustrates that even RL-trained models have fundamental gaps in dynamic reasoning, and that a 4B-parameter model trained with online RL on targeted trajectories can outperform much larger frontier models on specific agentic benchmarks. Targeted RL training on well-designed environments remains more effective than scale alone for narrow agentic tasks.
Recent developments
The 2024–2026 period has seen RL move from a specialized post-training technique to a pervasive ingredient across the LLM stack. The o1 release in September 2024 marked the public crystallization of RL-trained reasoning as a frontier capability. Since then, QwQ-32B, Magistral, and Kimina-Prover-RL have demonstrated that the recipe generalizes across labs, modalities, and domains. Research is actively attacking RL's remaining friction points: DRIFT reduces rollout cost, EnvFactory automates environment construction, and SafeCtrl-RL moves RL-based control to inference time. The field is also expanding RL's safety applications — from OpenAI's automated red-teaming of ChatGPT Atlas against prompt injection to Political Consistency Training for bias reduction — reflecting growing recognition that RL is not just a capability tool but a safety and alignment mechanism.




