Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes
paved-with-true-intents-intent-aware-training-improves-llm-safety-classification-across-training-regimes-b741b7e7·1 events·first seen 2d agoAliases: Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes
Co-occurring entities
More like this (12)
Recent events (1)
AIMS dataset and intent-aware training improve LLM safety classification across multiple regimes
Researchers introduce AIMS, a 1,724-sample human-annotated dataset of difficult safety prompts paired with intent descriptions and harm labels, designed to study intent-aware training for LLM safety classifiers. The paper evaluates intent-aware training across SFT, DPO, reasoning distillation, and GRPO reinforcement learning, finding that directly rewarding intent faithfulness via GRPO yields the strongest average performance across five external safety benchmarks. Intent-conditioned distillation also outperforms reasoning-only distillation in most teacher-student pairs, and intent-aware models form the inference latency-F1 Pareto frontier. The work argues that explicit user intent modeling is a compact, high-quality supervision signal for more robust safety classification.