Concept guide · In-depth

Supervised Fine-Tuning: Adapting Pretrained Models to Tasks

supervised fine-tuningIn-depthactive·v1 · live·generated 6d ago

TL;DRSupervised fine-tuning (SFT) is the step that turns a broadly capable pretrained language model into one that reliably performs a specific task. It sits at the center of modern LLM development pipelines — after pretraining and before alignment techniques like RLHF or DPO — and its tradeoffs around forgetting, compute, and data quality have become a primary axis of research as models scale.

Key takeaways

SFT was formalized as a paradigm in the 2018 GPT-1 paper: pretrain on large unlabeled corpora, then fine-tune on labeled task data — a pattern that seeded GPT-2, GPT-3, and the modern LLM stack.
The core tension in SFT is stability vs. plasticity: adapting to a new task risks overwriting general pretrained capabilities, a problem PEFT-Arena (2026) quantifies and frames as a Pareto frontier problem.
PEFT-Arena finds that final SFT checkpoints frequently overshoot the optimal retention operating point, motivating path-wise rewinding as a post-hoc correction.
Orthogonal fine-tuning achieves the best stability-plasticity Pareto frontier across PEFT methods tested under comparable parameter budgets.
SFT scales non-uniformly: fine-tuning Llama-3 70B on clinical provenance yields +7% Macro F1 over the base model, while the 8B variant shows only marginal gains from the same SFT procedure.
SFT is composable with downstream alignment: LLUMI uses SFT followed by DPO on community-derived preference pairs to match proprietary GPT-based models on mental health writing tasks.

What it is

Supervised fine-tuning (SFT) is the process of continuing to train a pretrained neural network on a curated set of labeled examples — (input, desired output) pairs — so that it reliably performs a specific task. In the LLM context, the base model's weights are initialized from a large pretraining run and then updated via standard gradient descent on the task dataset. The result is a model that retains the broad linguistic and world knowledge from pretraining while being shaped toward the target behavior.

The paradigm was formalized for language models in the 2018 GPT-1 paper, which demonstrated that combining transformer-based unsupervised pretraining with supervised fine-tuning yielded state-of-the-art results across diverse NLP tasks in a task-agnostic, scalable way. That two-stage recipe — pretrain on unlabeled text, fine-tune on labeled task data — became the template for GPT-2, GPT-3, and the modern LLM stack.

How it works

The mechanism is straightforward: given a pretrained model with weights θ, SFT minimizes a supervised loss (typically cross-entropy over target tokens) on a labeled dataset D = {(xᵢ, yᵢ)}. All or a subset of the model's parameters are updated. The key design decisions are:

Which parameters to update. Full SFT updates everything; parameter-efficient methods (LoRA, orthogonal fine-tuning, prefix tuning) freeze most weights and train a small adapter, reducing compute and forgetting risk.
Dataset quality and size. SFT is sensitive to label quality; noisy or misaligned labels propagate directly into model behavior.
Stopping point. Training too long overshoots the optimal checkpoint — a finding PEFT-Arena (2026) quantifies empirically, showing that final SFT checkpoints frequently sit past the Pareto-optimal retention point on the stability-plasticity curve.

Why it matters

SFT is the workhorse of LLM specialization. It is how a general-purpose base model becomes a coding assistant, a clinical note summarizer, a mental health writing aid, or a multimodal reasoning agent. Its position in the pipeline — after pretraining, before or alongside alignment techniques like RLHF and DPO — means that almost every deployed LLM has passed through at least one SFT stage.

It is also composable: LLUMI uses SFT on Reddit-derived preference pairs as a first stage, then applies DPO to further align outputs across readability, empathy, and safety dimensions, achieving performance comparable to proprietary GPT-based models on mental health writing tasks. ATLAS shows that standard SFT pipelines can train functional tokens for agentic operations and latent visual reasoning in multimodal models without any architectural changes.

The stability-plasticity problem

The central tension in SFT is forgetting. Adapting to a new task updates weights that also encode general pretrained capabilities; push too hard and the model becomes narrow. PEFT-Arena frames this as a stability-plasticity dilemma and evaluates PEFT methods jointly on downstream task performance and retention of pretrained capabilities under comparable parameter budgets. Its key findings:

Orthogonal fine-tuning achieves the best Pareto frontier across methods tested.
Final SFT checkpoints overshoot the optimal retention operating point, motivating path-wise rewinding — rolling back to an earlier checkpoint along the training trajectory — as a post-hoc correction.
Geometric analyses in both weight space (spectral/singular-value structure) and activation space (representation distortion) explain why different PEFT methods differ in forgetting behavior.

Scale dependence

SFT does not scale uniformly across model sizes. A clinical NLP study fine-tuning Llama-3 8B and 70B on sentence-level provenance categorization (MedSecId / MIMIC-III) found that SFT substantially improved the 70B model (+7% Macro F1 over the base) while yielding only marginal gains for the 8B model on the same task. Notably, a quantized fine-tuned 70B model outperformed its full-precision baseline while reducing compute — suggesting quantized SFT is viable for structured clinical tasks and that larger base models have more latent capacity to absorb task-specific signal.

Variants and alternatives

| Approach | Trainable params | Forgetting risk | Typical role | |---|---|---|---| | Full SFT | All | High (can overshoot) | Maximum task fit | | LoRA / orthogonal FT | Tiny fraction | Lower; orthogonal FT best Pareto | Most customization | | DPO / preference optimization | All or PEFT | Moderate | Post-SFT alignment | | Prompt / prefix tuning | Smallest | Minimal | Light task steering |

Where it's heading

Active research is pushing on three fronts: (1) forgetting mitigation — PEFT-Arena's path-wise rewinding and orthogonal fine-tuning represent the current best practice; (2) data efficiency — LLUMI's use of community endorsement signals (Reddit upvotes/downvotes) as a substitute for expensive expert labeling points toward scalable SFT data pipelines in sensitive domains; and (3) pipeline integration — ATLAS's demonstration that SFT can train agentic functional tokens without architectural changes suggests SFT will remain the entry point for new capability types even as RL-based alignment methods mature alongside it.

SFT in the modern LLM training pipeline

SFT vs. adjacent adaptation methods

Method	Trainable params	Forgetting risk	Typical position in pipeline	Key tradeoff
Full SFT	All	High (can overshoot)	Post-pretraining	Max task fit; risks erasing general capabilities
PEFT (e.g. LoRA, orthogonal FT)	Tiny fraction	Lower	Post-pretraining	Preserves base; orthogonal FT best Pareto per PEFT-Arena
DPO / preference optimization	All or PEFT	Moderate	Post-SFT	Aligns outputs to preferences without explicit reward model
Prompt / prefix tuning	Smallest	Minimal	Post-pretraining	Lightest intervention; weakest task adaptation

Synthesized from the events bundle; unknown cells render —.

Timeline

FAQ

What distinguishes SFT from pretraining?

Pretraining learns general representations from massive unlabeled corpora via self-supervised objectives; SFT continues training on smaller labeled (input, target) pairs to specialize the model for a task. GPT-1 established this two-stage split as the canonical LLM recipe.

Does SFT cause the model to forget what it learned during pretraining?

Yes — this is the stability-plasticity tradeoff. PEFT-Arena (2026) shows that full SFT checkpoints frequently overshoot an optimal retention point, eroding general capabilities; path-wise rewinding and parameter-efficient methods like orthogonal fine-tuning can mitigate this.

When does SFT scale well vs. poorly?

Scale matters: a clinical Llama-3 study found SFT gave +7% Macro F1 to the 70B model but only marginal gains to the 8B model on the same task, suggesting larger base models have more capacity to absorb task-specific signal.

Where does SFT sit relative to DPO and RLHF?

SFT typically precedes alignment steps: it teaches the model the task format and basic behavior, then DPO or RLHF refines outputs toward human preferences. LLUMI, for example, runs SFT first and then applies DPO on community-derived preference pairs.

Can SFT be used in agentic or multimodal pipelines?

Yes — ATLAS demonstrates that functional tokens for agentic operations and latent visual reasoning can be trained with standard SFT pipelines without architectural changes, though sparse reward signals may require stabilization techniques like LA-GRPO.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

v1live6d ago

Related guides (4)

supervised fine-tuningConcept

Supervised Fine-Tuning: Teaching an AI to Do Your Job

Read asBeginner

scalable oversightConcept

Scalable Oversight: Teaching AI to Help Humans Stay in Charge

Read asBeginner In-depth

Reinforcement Learning from Human FeedbackConcept

Reinforcement Learning from Human Feedback (RLHF): Teaching AI to Do What You Mean

Read asBeginner In-depth

knowledge distillationConcept

Knowledge Distillation: Compressing Model Intelligence into Smaller, Faster Successors

Read asIn-depth

More on supervised fine-tuning (6)

9Openai Blog·1mo ago·source ↗

Improving Language Understanding with Unsupervised Learning (GPT-1)

OpenAI published the GPT-1 paper in June 2018, demonstrating state-of-the-art results across diverse language tasks by combining transformer architectures with unsupervised pre-training followed by supervised fine-tuning. The approach is task-agnostic and scalable, showing that pre-training on large unlabeled text corpora and then fine-tuning on specific tasks yields strong generalization. This work established the foundational paradigm that would evolve into GPT-2, GPT-3, and subsequent large language models.

Frontier Model Releases Open Weights Progress Transformers GPT-1 OpenAI +3 more

5arXiv · cs.CL·22d ago·source ↗

LLUMI: Fine-Tuning Open-Source LLMs for Mental Health Writing Assistance Using Reddit Community Feedback

LLUMI is a two-component system (a generation model and an improvement model) designed to provide mental health writing assistance using smaller open-source LLMs hosted in privacy-preserving, on-premise environments. The system leverages Reddit community endorsement signals (upvotes/downvotes) to construct preference pairs for SFT and DPO training, then further aligns outputs via human evaluation across readability, empathy, connection, actionability, and safety dimensions. Results show LLUMI achieves performance comparable to proprietary GPT-based models on linguistic and human evaluations, suggesting community-derived preference signals can substitute for expensive expert labeling in sensitive domains.

Open Weights Progress AI Safety Research Reddit LLUMI Direct Preference Optimization (DPO)+3 more

4arXiv · cs.CL·18d ago·source ↗

Sentence-Level Clinical Provenance Categorization for Multidisciplinary Hospital Summarization Using Fine-Tuned Llama-3

This pilot study presents a pipeline for categorizing sentence-level clinical provenance across multi-source hospital notes, targeting structured summarization in high-complexity settings like the NICU. The authors fine-tune Llama-3 8B and 70B models on MedSecId (MIMIC-III annotations), achieving Macro F1 above 92% in-domain. Cross-domain evaluation reveals a scale-dependent transfer effect: SFT substantially improves the 70B model (+7% Macro F1) but yields only marginal gains for the 8B model. A quantized fine-tuned 70B model outperforms its full-precision baseline while reducing compute, suggesting quantized adaptation is viable for structured clinical NLP tasks.

Inference Economics Enterprise Deployment Patterns MIMIC-III Llama 3.1 70B quantization +4 more

6arXiv · cs.LG·23d ago·source ↗

PEFT-Arena: Benchmarking Parameter-Efficient Finetuning via Stability-Plasticity Trade-offs

PEFT-Arena is a new benchmark that evaluates parameter-efficient finetuning methods jointly on downstream task performance and retention of pretrained general capabilities, framing the problem as a stability-plasticity dilemma. Across methods tested under comparable parameter budgets, orthogonal finetuning achieves the best Pareto frontier. The paper provides geometric analyses in both weight space (spectral/singular-value structure) and activation space (representation distortion metrics) to explain why different PEFT methods differ in forgetting behavior. A practical finding is that final SFT checkpoints often overshoot an optimal retention operating point, motivating path-wise rewinding as a post-hoc correction.

Evaluation and Benchmarking Agent and Tool Ecosystem stability-plasticity dilemma stability-plasticity dilemma orthogonal finetuning +7 more

5arXiv · cs.CL·23d ago·source ↗

Cross-Annotator Preference Optimization (CAPO) for Learning Annotator-Specific Explanation Behavior

This paper investigates whether LLMs can learn and reproduce individual annotator-specific reasoning patterns, not just label choices, using two sentence-pair tasks (NLI and paraphrase judgment) with four annotators each. The authors find that annotator-specific patterns are weak at the single-annotation level but detectable after aggregation, and propose CAPO—a preference optimization method that contrasts a target annotator's response against other valid but less target-specific annotations. CAPO outperforms prompting and supervised fine-tuning baselines in capturing annotator-specific label-explanation behavior. The work suggests a path toward scalable annotation pipelines grounded in annotator histories rather than labels alone.

Evaluation and Benchmarking Alignment and RLHF Cross-Annotator Preference Optimization (CAPO)Human Label Variation (HLV)Natural Language Inference +2 more

7arXiv · cs.CL·22d ago·source ↗

Reinforcement Learning Recruits a Pre-Existing 'Functional Welfare' Axis in Language Models

Researchers trained language models in a semantically neutral maze environment and extracted concept vectors for rewarded and punished trajectories, finding that RL recruits a pre-existing representational axis encoding functional welfare—how well or badly the system is doing relative to its goals. The punishment vector promotes failure tokens, aligns with negative emotion concepts, and induces refusal and uncertainty when used for steering; the reward vector is its near-antiparallel mirror. Critically, these vectors are effective in models before maze training and appear in pretrain-only models, suggesting the welfare axis pre-exists post-training rather than being created by it. The findings have implications for interpretability, alignment, and understanding how minimal reward signals can broadly reshape model behavior.

Evaluation and Benchmarking AI Safety Research Reinforcement Learning from Human Feedback Concept Vector Extraction LoRA +3 more