technique

supervised fine-tuning

techniqueactivesupervised-fine-tuning-c23beecd·7 events·first seen 1mo ago

Aliases: supervised fine-tuning, supervised finetuning, Supervised Fine-Tuning (SFT), Supervised Fine Tuning

Co-occurring entities

More like this (12)

reinforcement fine-tuning fine-tuning importance-weighted supervised fine-tuning Parameter-Efficient Fine-Tuning finetuning behavioral fine-tuning instruction tuning OpenAI Fine-Tuning adapter fine-tuning malicious fine-tuning A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design Test-Time Finetuning (TTFT)

Guides (1)

supervised fine-tuningConcept

Supervised Fine-Tuning: Teaching an AI to Do Your Job

Read asBeginner In-depth

Recent events (7)

9Openai Blog·1mo ago·source ↗

Improving Language Understanding with Unsupervised Learning (GPT-1)

OpenAI published the GPT-1 paper in June 2018, demonstrating state-of-the-art results across diverse language tasks by combining transformer architectures with unsupervised pre-training followed by supervised fine-tuning. The approach is task-agnostic and scalable, showing that pre-training on large unlabeled text corpora and then fine-tuning on specific tasks yields strong generalization. This work established the foundational paradigm that would evolve into GPT-2, GPT-3, and subsequent large language models.

Frontier Model Releases Open Weights Progress Transformers GPT-1 OpenAI +3 more

5arXiv · cs.CL·22d ago·source ↗

LLUMI: Fine-Tuning Open-Source LLMs for Mental Health Writing Assistance Using Reddit Community Feedback

LLUMI is a two-component system (a generation model and an improvement model) designed to provide mental health writing assistance using smaller open-source LLMs hosted in privacy-preserving, on-premise environments. The system leverages Reddit community endorsement signals (upvotes/downvotes) to construct preference pairs for SFT and DPO training, then further aligns outputs via human evaluation across readability, empathy, connection, actionability, and safety dimensions. Results show LLUMI achieves performance comparable to proprietary GPT-based models on linguistic and human evaluations, suggesting community-derived preference signals can substitute for expensive expert labeling in sensitive domains.

Open Weights Progress AI Safety Research Reddit LLUMI Direct Preference Optimization (DPO)+3 more

4arXiv · cs.CL·18d ago·source ↗

Sentence-Level Clinical Provenance Categorization for Multidisciplinary Hospital Summarization Using Fine-Tuned Llama-3

This pilot study presents a pipeline for categorizing sentence-level clinical provenance across multi-source hospital notes, targeting structured summarization in high-complexity settings like the NICU. The authors fine-tune Llama-3 8B and 70B models on MedSecId (MIMIC-III annotations), achieving Macro F1 above 92% in-domain. Cross-domain evaluation reveals a scale-dependent transfer effect: SFT substantially improves the 70B model (+7% Macro F1) but yields only marginal gains for the 8B model. A quantized fine-tuned 70B model outperforms its full-precision baseline while reducing compute, suggesting quantized adaptation is viable for structured clinical NLP tasks.

Inference Economics Enterprise Deployment Patterns MIMIC-III Llama 3.1 70B quantization +4 more

6arXiv · cs.LG·23d ago·source ↗

PEFT-Arena: Benchmarking Parameter-Efficient Finetuning via Stability-Plasticity Trade-offs

PEFT-Arena is a new benchmark that evaluates parameter-efficient finetuning methods jointly on downstream task performance and retention of pretrained general capabilities, framing the problem as a stability-plasticity dilemma. Across methods tested under comparable parameter budgets, orthogonal finetuning achieves the best Pareto frontier. The paper provides geometric analyses in both weight space (spectral/singular-value structure) and activation space (representation distortion metrics) to explain why different PEFT methods differ in forgetting behavior. A practical finding is that final SFT checkpoints often overshoot an optimal retention operating point, motivating path-wise rewinding as a post-hoc correction.

Evaluation and Benchmarking Agent and Tool Ecosystem stability-plasticity dilemma stability-plasticity dilemma orthogonal finetuning +7 more

5arXiv · cs.CL·23d ago·source ↗

Cross-Annotator Preference Optimization (CAPO) for Learning Annotator-Specific Explanation Behavior

This paper investigates whether LLMs can learn and reproduce individual annotator-specific reasoning patterns, not just label choices, using two sentence-pair tasks (NLI and paraphrase judgment) with four annotators each. The authors find that annotator-specific patterns are weak at the single-annotation level but detectable after aggregation, and propose CAPO—a preference optimization method that contrasts a target annotator's response against other valid but less target-specific annotations. CAPO outperforms prompting and supervised fine-tuning baselines in capturing annotator-specific label-explanation behavior. The work suggests a path toward scalable annotation pipelines grounded in annotator histories rather than labels alone.

Evaluation and Benchmarking Alignment and RLHF Cross-Annotator Preference Optimization (CAPO)Human Label Variation (HLV)Natural Language Inference +2 more

7arXiv · cs.CL·22d ago·source ↗

Reinforcement Learning Recruits a Pre-Existing 'Functional Welfare' Axis in Language Models

Researchers trained language models in a semantically neutral maze environment and extracted concept vectors for rewarded and punished trajectories, finding that RL recruits a pre-existing representational axis encoding functional welfare—how well or badly the system is doing relative to its goals. The punishment vector promotes failure tokens, aligns with negative emotion concepts, and induces refusal and uncertainty when used for steering; the reward vector is its near-antiparallel mirror. Critically, these vectors are effective in models before maze training and appear in pretrain-only models, suggesting the welfare axis pre-exists post-training rather than being created by it. The findings have implications for interpretability, alignment, and understanding how minimal reward signals can broadly reshape model behavior.

Evaluation and Benchmarking AI Safety Research Reinforcement Learning from Human Feedback Concept Vector Extraction LoRA +3 more

6arXiv · cs.CL·1mo ago·source ↗

ATLAS: Unified Agentic and Latent Visual Reasoning via Functional Tokens

ATLAS proposes a framework where a single discrete 'functional token' serves dual roles as both an agentic operation trigger and a latent visual reasoning unit in multimodal models. This design avoids the computational cost of generating intermediate images while sidestepping the context-switching latency of external tool calls and the generalization limitations of pure latent methods. The framework is compatible with standard SFT and RL training pipelines without architectural changes, and introduces Latent-Anchored GRPO (LA-GRPO) to stabilize reinforcement learning when functional tokens are sparse. Experiments show strong performance on visual reasoning benchmarks with maintained interpretability.

Evaluation and Benchmarking Agent and Tool Ecosystem functional token GRPO Latent-Anchored GRPO +4 more