paper

RAS: Measuring LLM Safety Through Refusal Alignment

paperactiveprovisionalras-measuring-llm-safety-through-refusal-alignment-5f0cd529·1 events·first seen 1h ago

Aliases: RAS: Measuring LLM Safety Through Refusal Alignment

Co-occurring entities

More like this (12)

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals MedRLM LLM Safety Leaderboard AI Safety Level (ASL)SafeRL-Lab Measuring Epistemic Resilience of LLMs Under Misleading Medical Context ExpRL: Exploratory RL for LLM Mid-Training Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond Clinically Grounded Privacy Evaluation of Medical LMs Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NFR Assessment RL²

Recent events (1)

6arXiv · cs.CL·1h ago·source ↗

SafeVec and RAS: White-box LLM safety evaluation via internal refusal representations

Researchers introduce SafeVec, a white-box safety evaluation procedure that measures LLM safety from internal hidden-state representations rather than generated outputs. The method extracts layer-wise refusal directions from a safety-aligned reference model, identifies stable layers where safe and unsafe behaviors are separable, and scores target models via a calibrated 0-100 Refusal Alignment Score (RAS). Evaluated across Llama, Gemma, and Qwen model families, RAS distinguishes aligned from uncensored/abliterated variants and correlates with output-level attack success rates while being substantially faster than judge-based evaluation. The approach addresses key limitations of output-level safety evals: cost, judge sensitivity, and dependence on fixed question banks.

Evaluation and Benchmarking AI Safety Research SafeVec Gemma RAS: Measuring LLM Safety Through Refusal Alignment +2 more