Almanac
technique

SafeVec

techniqueactiveprovisionalsafevec-1d8b0053·1 events·first seen 3h ago

Aliases: SafeVec

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·3h ago·source ↗

SafeVec and RAS: White-box LLM safety evaluation via internal refusal representations

Researchers introduce SafeVec, a white-box safety evaluation procedure that measures LLM safety from internal hidden-state representations rather than generated outputs. The method extracts layer-wise refusal directions from a safety-aligned reference model, identifies stable layers where safe and unsafe behaviors are separable, and scores target models via a calibrated 0-100 Refusal Alignment Score (RAS). Evaluated across Llama, Gemma, and Qwen model families, RAS distinguishes aligned from uncensored/abliterated variants and correlates with output-level attack success rates while being substantially faster than judge-based evaluation. The approach addresses key limitations of output-level safety evals: cost, judge sensitivity, and dependence on fixed question banks.