SafeVec
safevec-1d8b0053·1 events·first seen 3h agoAliases: SafeVec
Co-occurring entities
More like this (12)
Recent events (1)
SafeVec and RAS: White-box LLM safety evaluation via internal refusal representations
Researchers introduce SafeVec, a white-box safety evaluation procedure that measures LLM safety from internal hidden-state representations rather than generated outputs. The method extracts layer-wise refusal directions from a safety-aligned reference model, identifies stable layers where safe and unsafe behaviors are separable, and scores target models via a calibrated 0-100 Refusal Alignment Score (RAS). Evaluated across Llama, Gemma, and Qwen model families, RAS distinguishes aligned from uncensored/abliterated variants and correlates with output-level attack success rates while being substantially faster than judge-based evaluation. The approach addresses key limitations of output-level safety evals: cost, judge sensitivity, and dependence on fixed question banks.