paper

Online Safety Monitoring for LLMs

paperactiveprovisionalonline-safety-monitoring-for-llms-19fe32b3·1 events·first seen 12h ago

Aliases: Online Safety Monitoring for LLMs

More like this (12)

LLM-as-monitor Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond LLM Safety Leaderboard What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?Moral Safety in LLMs: Exposing Performative Compliance with Puzzled Cues open-source LLMs Clinically Grounded Privacy Evaluation of Medical LMs LLMScan electronic warfare LLM frontier LLMs Multi-Agentic System Leveraging Open-Source LLMs to Mitigate Disinformation Threats Beyond Third-Person Audits: Situated Interaction Auditing for User-Centered LLM Bias Research

Recent events (1)

5arXiv · cs.CL·12h ago·source ↗

Online Safety Monitoring for LLMs via Threshold-Based Risk Control

A new arXiv preprint proposes a real-time safety monitor for LLMs that converts an external verifier signal into an alarm by thresholding, with the threshold calibrated via risk control. The authors evaluate the approach on mathematical reasoning and red-teaming datasets, finding it competitive with more complex sequential hypothesis testing monitors. The work addresses the practical deployment problem of detecting unsafe outputs after alignment training.

Evaluation and Benchmarking AI Safety Research Online Safety Monitoring for LLMs