paper

Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes

paperactiveprovisionalpaved-with-true-intents-intent-aware-training-improves-llm-safety-classification-across-training-regimes-b741b7e7·1 events·first seen 2d ago

Aliases: Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes

Co-occurring entities

AIMS GRPO DPO

More like this (12)

ExpRL: Exploratory RL for LLM Mid-Training RAS: Measuring LLM Safety Through Refusal Alignment What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?Forecasting With LLMs: Improved Generalization Through Feature Steering Verifier-in-the-Loop Training (ViL)Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization InSight: Self-Guided Skill Acquisition via Steerable VLAs InSight: Self-Guided Skill Acquisition via Steerable VLAs quantization-aware training LLM-based content classification Language Model Safety Monitor AIR: Adaptive Interleaved Reasoning with Code in MLLMs

Recent events (1)

5arXiv · cs.CL·2d ago·source ↗

AIMS dataset and intent-aware training improve LLM safety classification across multiple regimes

Researchers introduce AIMS, a 1,724-sample human-annotated dataset of difficult safety prompts paired with intent descriptions and harm labels, designed to study intent-aware training for LLM safety classifiers. The paper evaluates intent-aware training across SFT, DPO, reasoning distillation, and GRPO reinforcement learning, finding that directly rewarding intent faithfulness via GRPO yields the strongest average performance across five external safety benchmarks. Intent-conditioned distillation also outperforms reasoning-only distillation in most teacher-student pairs, and intent-aware models form the inference latency-F1 Pareto frontier. The work argues that explicit user intent modeling is a compact, high-quality supervision signal for more robust safety classification.

Evaluation and Benchmarking AI Safety Research AIMS GRPO Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes +1 more