Almanac
← Events
5arXiv cs.CL (Computation and Language)·12h ago

Online Safety Monitoring for LLMs via Threshold-Based Risk Control

A new arXiv preprint proposes a real-time safety monitor for LLMs that converts an external verifier signal into an alarm by thresholding, with the threshold calibrated via risk control. The authors evaluate the approach on mathematical reasoning and red-teaming datasets, finding it competitive with more complex sequential hypothesis testing monitors. The work addresses the practical deployment problem of detecting unsafe outputs after alignment training.

Related guides (2)

Related events (8)

6arXiv · cs.CL·4d ago·source ↗

Mechanism-driven internal monitors detect LLM training instability thousands of steps before loss divergence

A new arXiv preprint proposes mechanism-driven monitoring signals derived from the functional roles of critical modules (low-precision flash attention, MoE routers) to detect training instability before it manifests in loss or gradient norms. The authors derive monitors such as spectral entropy of a QK bilinear decomposition and MoE router indicators, showing via fault-injection experiments that these signals trigger thousands of steps ahead of loss divergence. The work targets a high-cost failure mode in frontier LLM training where instability can persist undetected for thousands of steps on expensive accelerator fleets.

6arXiv · cs.CL·1mo ago·source ↗

HarmAmp Benchmark and TrajSafe Monitor for Multi-Turn Harm Amplification in LLMs

This paper introduces HarmAmp, a benchmark covering twelve risk categories designed to evaluate how LLMs compound harm across multi-turn conversations, addressing two threat vectors: democratizing specialized harmful expertise and scaling harmful operations. The authors also propose TrajSafe, a proactive monitoring system that anticipates harmful conversational trajectories and intervenes by probing user intent or steering toward safer outputs. Experiments show TrajSafe reduces multi-turn harmfulness while maintaining low over-refusal rates and preserving general model capabilities. The work highlights a gap in existing safety research that focuses on single-turn evaluations rather than extended interaction dynamics.

6arXiv · cs.CL·8d ago·source ↗

SafeVec and RAS: White-box LLM safety evaluation via internal refusal representations

Researchers introduce SafeVec, a white-box safety evaluation procedure that measures LLM safety from internal hidden-state representations rather than generated outputs. The method extracts layer-wise refusal directions from a safety-aligned reference model, identifies stable layers where safe and unsafe behaviors are separable, and scores target models via a calibrated 0-100 Refusal Alignment Score (RAS). Evaluated across Llama, Gemma, and Qwen model families, RAS distinguishes aligned from uncensored/abliterated variants and correlates with output-level attack success rates while being substantially faster than judge-based evaluation. The approach addresses key limitations of output-level safety evals: cost, judge sensitivity, and dependence on fixed question banks.

5Hugging Face Blog·1mo ago·source ↗

An Introduction to AI Secure LLM Safety Leaderboard

Hugging Face introduces the DecodingTrust-based LLM Safety Leaderboard, a benchmark framework for evaluating large language models across multiple safety and trustworthiness dimensions. The leaderboard aims to provide standardized, reproducible safety assessments covering areas such as toxicity, stereotype bias, adversarial robustness, and privacy. It offers a public ranking of models to help researchers and practitioners compare safety properties across different LLMs.

5arXiv · cs.AI·3d ago·source ↗

EvalSafetyGap: Conceptual framework linking LLM evaluation failures to safety measurement gaps

A new arXiv preprint introduces EvalSafetyGap, a hybrid survey and conceptual framework arguing that benchmark scores, reward-model signals, and safety metrics can improve while the underlying properties they measure remain unverified. The paper synthesizes eight evidence streams spanning 2018–2026 and introduces two analytical constructs — an Instability Decomposition and an Alignment Trilemma — to structure comparisons between evaluation-side and alignment-side proxy failures under optimization pressure. A ten-model audit finds no statistically significant association between capability and adversarial robustness, and suggests the apparent open-versus-closed-model safety gap is driven more by governance and disclosure practices than behavioral robustness. The work proposes a shared vocabulary for dynamic evaluation, multi-attempt safety measurement, and auditable alignment practice.

6arXiv · cs.CL·10d ago·source ↗

LLMs fail to reliably self-report adversarial prefill attacks, study finds

A new arXiv paper evaluates whether LLMs can recognize that their own prior responses were elicited by adversarial prefill attacks, testing ten open-weight models (3B–70B) across four safety benchmarks. Models claim intent on prefilled responses only 27.3% of the time on average, and introspective signal is largely mediated by refusal-related reasoning. Three LoRA fine-tuning methods (SFT, GRPO, DPO) improve the intention-probe gap but counterintuitively raise attack success rates on most models, suggesting partial and fragile mitigation. The findings raise concerns about the reliability of LLM self-reports in safety-critical contexts.

4arXiv · cs.CL·3d ago·source ↗

Uncertainty-aware decision-making algorithms for LLMs using Bayesian and risk-averse methods

A new arXiv preprint proposes and evaluates uncertainty-aware decision-making algorithms for LLMs grounded in Bayesian decision theory and risk-averse decision making, applied to tutoring and automatic peer review tasks. The authors incorporate conformal prediction to provide formal guarantees over strategy and score outputs. Empirical results show Bayesian methods outperform risk-averse rules, which can degrade to generic outputs under high ambiguity. The work highlights a gap in decision-making algorithm research relative to model training advances.

5arXiv · cs.CL·17d ago·source ↗

Semi-supervised framework scales LLM reasoning with minimal labeled data via lightweight verifier

A new arXiv preprint proposes a semi-supervised framework for training LLMs to reason with very few labeled examples, using a lightweight classifier to judge the validity of intermediate reasoning traces. An entropy-based confidence threshold filters unreliable pseudo-labels before fine-tuning. Experiments on math reasoning (Orca-Math subset) and visual QA (GQA) show accuracy comparable to using 10-15x more labeled data. The approach reduces dependence on expensive answer-level supervision by turning verification into a data-creation mechanism.