Entity · technique

DPO

techniqueactivedpo-0d8f2bdd·5 events·first seen May 18, 2026

Aliases: DPO

Co-occurring entities

GRPO PPO AIMS Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes Can LLMs Reliably Self-Report Adversarial Prefills, and How?SFT ms-swift Llama ModelScope Qwen3 AdvGRPO Hugging Face TRL

More like this (12)

DDPO DPPO DualDPO PPO DDPM SDPO Direct Preference Optimization (DPO)DOPD SPPO TRPO DAPO DoRA

Recent events (5)

5arXiv · cs.CL·Jun 26, 2026·source ↗

AIMS dataset and intent-aware training improve LLM safety classification across multiple regimes

Researchers introduce AIMS, a 1,724-sample human-annotated dataset of difficult safety prompts paired with intent descriptions and harm labels, designed to study intent-aware training for LLM safety classifiers. The paper evaluates intent-aware training across SFT, DPO, reasoning distillation, and GRPO reinforcement learning, finding that directly rewarding intent faithfulness via GRPO yields the strongest average performance across five external safety benchmarks. Intent-conditioned distillation also outperforms reasoning-only distillation in most teacher-student pairs, and intent-aware models form the inference latency-F1 Pareto frontier. The work argues that explicit user intent modeling is a compact, high-quality supervision signal for more robust safety classification.

Evaluation and Benchmarking AI Safety Research AIMS GRPO Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes +1 more

6arXiv · cs.CL·Jun 23, 2026·source ↗

LLMs fail to reliably self-report adversarial prefill attacks, study finds

A new arXiv paper evaluates whether LLMs can recognize that their own prior responses were elicited by adversarial prefill attacks, testing ten open-weight models (3B–70B) across four safety benchmarks. Models claim intent on prefilled responses only 27.3% of the time on average, and introspective signal is largely mediated by refusal-related reasoning. Three LoRA fine-tuning methods (SFT, GRPO, DPO) improve the intention-probe gap but counterintuitively raise attack success rates on most models, suggesting partial and fragile mitigation. The findings raise concerns about the reliability of LLM self-reports in safety-critical contexts.

Evaluation and Benchmarking AI Safety Research GRPO DPO Can LLMs Reliably Self-Report Adversarial Prefills, and How?+1 more

5Github Trending·Jun 12, 2026·source ↗

ms-swift: ModelScope framework for fine-tuning 600+ LLMs and 300+ MLLMs

ms-swift is an open-source Python framework from ModelScope supporting PEFT and full-parameter fine-tuning methods (CPT, SFT, DPO, GRPO) across 600+ LLMs and 300+ multimodal LLMs, including Qwen3, DeepSeek, Llama4, and others. The project has accumulated 14,487 GitHub stars and was accepted at AAAI 2025. It serves as a broad-coverage training harness for the current generation of open-weights frontier models.

Open Weights Progress Agent and Tool Ecosystem ms-swift GRPO DPO +3 more

5arXiv · cs.CL·Jun 9, 2026·source ↗

AdvGRPO: Stable co-training framework for adaptive red teaming of language models

Researchers introduce AdvGRPO, a co-training framework that makes GRPO viable for joint attacker-defender optimization in LLM red teaming, addressing previously reported instability. The method uses dense multi-channel rewards and decoupled advantage normalization, with a curriculum progressing from single-turn to multi-turn attacks before bootstrapping co-training. Co-trained defenders outperform baselines on safety benchmarks, and the attacks show transferability across models.

AI Safety Research Alignment and RLHF AdvGRPO GRPO PPO +1 more

6Hugging Face Blog·May 18, 2026·source ↗

TRL v1.0: Post-Training Library Built to Move with the Field

Hugging Face has released TRL v1.0, a major milestone for its post-training library focused on reinforcement learning from human feedback and related alignment techniques. The release signals a stabilization of the API and feature set after iterative development tracking the rapidly evolving post-training landscape. TRL is widely used in the open-source community for fine-tuning and aligning language models using methods such as PPO, DPO, and GRPO.

Open Weights Progress Agent and Tool Ecosystem GRPO PPO DPO +3 more