Entity · technique

TrajSafe

techniqueactivetrajsafe-2f354867·1 events·first seen Jun 2, 2026

Aliases: TrajSafe

Co-occurring entities

large language models HarmAmp

More like this (12)

SciTraj TrajTok TurnTrout Traycer TRACE-ROUTER TraceLab SaliTrap CoTrace Tracer-Cloud PsychoSafe VeriTrace SimpleTrace

Recent events (1)

6arXiv · cs.CL·Jun 2, 2026·source ↗

HarmAmp Benchmark and TrajSafe Monitor for Multi-Turn Harm Amplification in LLMs

This paper introduces HarmAmp, a benchmark covering twelve risk categories designed to evaluate how LLMs compound harm across multi-turn conversations, addressing two threat vectors: democratizing specialized harmful expertise and scaling harmful operations. The authors also propose TrajSafe, a proactive monitoring system that anticipates harmful conversational trajectories and intervenes by probing user intent or steering toward safer outputs. Experiments show TrajSafe reduces multi-turn harmfulness while maintaining low over-refusal rates and preserving general model capabilities. The work highlights a gap in existing safety research that focuses on single-turn evaluations rather than extended interaction dynamics.

Evaluation and Benchmarking AI Safety Research large language models HarmAmp TrajSafe +1 more