Almanac
technique

TrajSafe

techniqueactiveprovisionaltrajsafe-2f354867·1 events·first seen 15d ago

Aliases: TrajSafe

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·15d ago·source ↗

HarmAmp Benchmark and TrajSafe Monitor for Multi-Turn Harm Amplification in LLMs

This paper introduces HarmAmp, a benchmark covering twelve risk categories designed to evaluate how LLMs compound harm across multi-turn conversations, addressing two threat vectors: democratizing specialized harmful expertise and scaling harmful operations. The authors also propose TrajSafe, a proactive monitoring system that anticipates harmful conversational trajectories and intervenes by probing user intent or steering toward safer outputs. Experiments show TrajSafe reduces multi-turn harmfulness while maintaining low over-refusal rates and preserving general model capabilities. The work highlights a gap in existing safety research that focuses on single-turn evaluations rather than extended interaction dynamics.