Almanac
paper

RAS: Measuring LLM Safety Through Refusal Alignment

paperactiveprovisionalras-measuring-llm-safety-through-refusal-alignment-5f0cd529·1 events·first seen 1h ago

Aliases: RAS: Measuring LLM Safety Through Refusal Alignment

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·1h ago·source ↗

SafeVec and RAS: White-box LLM safety evaluation via internal refusal representations

Researchers introduce SafeVec, a white-box safety evaluation procedure that measures LLM safety from internal hidden-state representations rather than generated outputs. The method extracts layer-wise refusal directions from a safety-aligned reference model, identifies stable layers where safe and unsafe behaviors are separable, and scores target models via a calibrated 0-100 Refusal Alignment Score (RAS). Evaluated across Llama, Gemma, and Qwen model families, RAS distinguishes aligned from uncensored/abliterated variants and correlates with output-level attack success rates while being substantially faster than judge-based evaluation. The approach addresses key limitations of output-level safety evals: cost, judge sensitivity, and dependence on fixed question banks.