technique
TrajSafe
techniqueactiveprovisional
trajsafe-2f354867·1 events·first seen 15d agoAliases: TrajSafe
Co-occurring entities
More like this (12)
Recent events (1)
HarmAmp Benchmark and TrajSafe Monitor for Multi-Turn Harm Amplification in LLMs
This paper introduces HarmAmp, a benchmark covering twelve risk categories designed to evaluate how LLMs compound harm across multi-turn conversations, addressing two threat vectors: democratizing specialized harmful expertise and scaling harmful operations. The authors also propose TrajSafe, a proactive monitoring system that anticipates harmful conversational trajectories and intervenes by probing user intent or steering toward safer outputs. Experiments show TrajSafe reduces multi-turn harmfulness while maintaining low over-refusal rates and preserving general model capabilities. The work highlights a gap in existing safety research that focuses on single-turn evaluations rather than extended interaction dynamics.