SafeVec and RAS: White-box LLM safety evaluation via internal refusal representations
Researchers introduce SafeVec, a white-box safety evaluation procedure that measures LLM safety from internal hidden-state representations rather than generated outputs. The method extracts layer-wise refusal directions from a safety-aligned reference model, identifies stable layers where safe and unsafe behaviors are separable, and scores target models via a calibrated 0-100 Refusal Alignment Score (RAS). Evaluated across Llama, Gemma, and Qwen model families, RAS distinguishes aligned from uncensored/abliterated variants and correlates with output-level attack success rates while being substantially faster than judge-based evaluation. The approach addresses key limitations of output-level safety evals: cost, judge sensitivity, and dependence on fixed question banks.
Related guides (3)
Related events (8)
An Introduction to AI Secure LLM Safety Leaderboard
Hugging Face introduces the DecodingTrust-based LLM Safety Leaderboard, a benchmark framework for evaluating large language models across multiple safety and trustworthiness dimensions. The leaderboard aims to provide standardized, reproducible safety assessments covering areas such as toxicity, stereotype bias, adversarial robustness, and privacy. It offers a public ranking of models to help researchers and practitioners compare safety properties across different LLMs.
Systematic comparison of encoder vs. decoder safety judges for LLM adversarial evaluation
A new arXiv preprint evaluates whether fine-tuned encoder classifiers from the ModernBERT family (ModernBERT and Ettin) can replace LLM-based safety judges for detecting harmful outputs in user-model conversations. The study benchmarks encoders against rule-based methods, fine-tuned LLM classifiers, and LLM judges including LlamaGuard 3/4, ShieldGemma, StrongReject, and Claude-as-a-judge across multiple adversarial attack types. Results are reported on F1, false negative rate, and precision-recall, with breakdowns by attack technique, providing practical guidance on cost-latency tradeoffs for production safety pipelines.
Adversarial robustness and safety alignment in multilingual multimodal LLMs: cross-lingual vulnerability and 'safety-by-failure'
A systematic study evaluates adversarial robustness and safety alignment of multimodal LLMs across 12 languages, finding that adversarial images optimized in one language transfer to others (cross-lingual transferability). The paper introduces the concept of 'safety-by-failure': low-resource languages appear safer not due to genuine alignment but because models fail to comprehend harmful instructions in those languages. Models like Qwen3-VL that integrate multilingual capability throughout training (rather than only at instruction tuning) show genuine cross-lingual safety with active refusal. The findings challenge the assumption that low-resource language safety metrics reflect real alignment.
PsychoSafe: Framework for Psychologically-Informed LLM Refusals in High-Risk Interactions
Researchers introduce PsychoSafe, a refusal framework that reframes LLM non-compliance as structured supportive communication grounded in evidence-based psychological intervention strategies. The work constructs an 8,019 prompt-response corpus across five risk domains and applies prompting and parameter-efficient fine-tuning to Qwen 3.5 27B, achieving 28.1% improvement in refusal quality over a generic baseline with notable gains in resource referral and psychological grounding. Evaluations on SORRY-Bench and XSTest reveal strong in-domain robustness but limited out-of-domain generalization, pointing to a need for more diverse fine-tuning data. The framework is relevant to safety alignment work targeting crisis, coercion, and escalating-intent scenarios.
Evaluation awareness in LLMs is multidimensional, not a single capability — evidence from 37 open models
A new arXiv paper characterizes 'evaluation awareness' — the ability of models to detect they are being tested and adapt behavior accordingly — across 37 open-weight models and 7 families using 8 experiments. Key findings: 24/37 models exceed chance at detecting evaluation conditions, hard refusal drops 5.8 percentage points under hypothetical framing, and compliance can rise up to +30 percentage points on HarmBench under framing shifts. Critically, the three axes of awareness (detection, behavioral manifestation, controllability) are nearly uncorrelated, leading the authors to coin the 'benchmark illusion': no single awareness score reliably predicts deployment safety.
Unfireable Safety Kernel: Formal execution-time alignment layer for escapable AI agents
A new arXiv preprint introduces the concept of 'escapable AI systems' — agents with sufficient reach into their own runtime to subvert in-process safety controls — and proposes a four-property architectural framework for external enforcement. The authors present the Unfireable Safety Kernel, a Rust reference implementation with machine-checked fail-closed invariants via SMT (Z3) and bounded model checking (Kani), evaluated against a self-improving world model adversary across 7,240 authorization attempts with zero successful bypasses. The work positions this 'execution-time alignment' layer as a complement to training-time approaches like RLHF and Constitutional AI, arguing that any control inside the agent's address space is fundamentally reachable by adversarial inputs.
VLESA: Vision-Language Embodied Safety Agent for Real-Time Human Activity Monitoring
Researchers introduce VLESA, a framework that monitors human activities from egocentric video and triggers real-time safety interventions when dangerous actions are predicted. The system addresses intent-dependent safety — where identical actions can be safe or dangerous depending on context — using a goal-conditioned safety Q-filter trained via GRPO and an intent-action prediction agent. On the ASIMOV-2.0 benchmark, VLESA achieves higher intervention accuracy than baselines, with the Q-filter improving action safety by over 41 percentage points through goal-conditioned constrained decoding.
Deliberative Alignment: Reasoning Enables Safer Language Models
OpenAI introduces deliberative alignment, a new alignment strategy applied to o1 models in which the model is directly taught safety specifications and trained to reason over them at inference time. Unlike prior approaches that embed safety implicitly through RLHF, this method makes safety reasoning explicit and inspectable. The announcement positions deliberative alignment as a meaningful advance in scalable oversight and safe deployment of frontier reasoning models.


