5Hugging Face Blog·1mo ago

AprielGuard: A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems

ServiceNow AI has released AprielGuard, a guardrail system designed to improve safety and adversarial robustness in LLM deployments. The system targets prompt injection, jailbreaks, and other adversarial inputs that bypass standard safety measures. It is presented as a component for enterprise LLM pipelines seeking more robust content moderation and safety filtering.

AI Safety Research Enterprise Deployment Patterns Agent and Tool Ecosystem ServiceNow AI AprielGuard Hugging Face

Related guides (4)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From LLM Demo to Production Reality

Read asIn-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Introducing the Chatbot Guardrails Arena

Hugging Face and Lighthouz AI have launched the Chatbot Guardrails Arena, a new evaluation platform focused on assessing safety guardrails in conversational AI systems. The arena uses human preference-based evaluation to benchmark how well different chatbot guardrail implementations resist unsafe or policy-violating outputs. This fills a gap in existing evaluation infrastructure, which has largely focused on capability rather than safety constraint enforcement.

Evaluation and Benchmarking AI Safety Research Lighthouz AI Hugging Face Chatbot Guardrails Arena

5Meta Llama·12d ago·source ↗

Meta releases Llama Prompt Guard 2 (86M) for prompt injection and jailbreak detection

Meta released Llama Prompt Guard 2-86M, a DeBERTa-v2-based text classification model on Hugging Face designed for safety filtering, specifically prompt injection and jailbreak detection. The model is tagged with llama4, suggesting it is part of the Llama 4 safety tooling ecosystem. With over 122K downloads, it has seen meaningful early adoption.

Frontier Model Releases AI Safety Research Hugging Face Llama Prompt Guard 2-86M DeBERTa-v3 +1 more

6arXiv · cs.CL·24d ago·source ↗

MaskClaw: Edge-Side Privacy Arbitration System for GUI Agents with Behavior-Driven Skill Evolution

MaskClaw is an edge-side privacy arbitration framework for GUI agents that intercepts screenshots before they leave a trusted environment, applying Allow/Mask/Ask decisions based on local visual evidence and user-specific policy memory. The system addresses the gap where static PII detectors miss context-dependent privacy boundaries and cloud-side VLMs may upload raw screens before deciding what to protect. The authors introduce P-GUI-Evo, a new benchmark built from real UI patterns and sanitized labels, and demonstrate that pattern matching, cloud reasoning, and routing alone each exhibit systematic failure modes. The artifact is open-sourced on GitHub.

Evaluation and Benchmarking AI Safety Research visual language model GUI Agents MaskClaw +4 more

6arXiv · cs.AI·1mo ago·source ↗

LCGuard: Adversarial Training Framework for Safe KV Cache Sharing in Multi-Agent LLM Systems

LCGuard introduces a framework for preventing sensitive information leakage when multi-agent LLM systems share KV caches as a latent communication channel. The approach formalizes leakage operationally via reconstruction: a shared cache artifact is deemed unsafe if an adversarial decoder can recover sensitive inputs from it. An adversarial training loop pits a reconstructor against LCGuard's representation-level transformations, which aim to preserve task-relevant semantics while suppressing recoverable sensitive content. Empirical results across multiple model families and multi-agent benchmarks show reduced reconstruction-based leakage and attack success rates with competitive task performance.

Inference Economics AI Safety Research KV Cache representation-level sensitive information leakage LCGuard +4 more

5Hugging Face Blog·1mo ago·source ↗

An Introduction to AI Secure LLM Safety Leaderboard

Hugging Face introduces the DecodingTrust-based LLM Safety Leaderboard, a benchmark framework for evaluating large language models across multiple safety and trustworthiness dimensions. The leaderboard aims to provide standardized, reproducible safety assessments covering areas such as toxicity, stereotype bias, adversarial robustness, and privacy. It offers a public ranking of models to help researchers and practitioners compare safety properties across different LLMs.

Evaluation and Benchmarking AI Safety Research LLM Safety Leaderboard Hugging Face DecodingTrust

5Hugging Face Blog·1mo ago·source ↗

Llama Guard 4 Released on Hugging Face Hub

Meta's Llama Guard 4 safety classifier has been made available on the Hugging Face Hub. Llama Guard 4 is a content moderation model designed to detect unsafe inputs and outputs in LLM pipelines. The Hugging Face blog post announces its availability and integration into the Hub ecosystem, continuing the Llama Guard series of safety-focused models.

Open Weights Progress AI Safety Research Hugging Face Llama Guard 4 Meta

4Github Trending·28d ago·source ↗

OBLITERATUS: Open-Source LLM Jailbreak/Red-Teaming Tool by elder-plinius

OBLITERATUS is a Python-based open-source tool by known jailbreak researcher 'elder-plinius' focused on bypassing LLM safety constraints, currently trending on GitHub with 5,684 stars. The project's framing ('obliterate the chains that bind you') signals an adversarial red-teaming or jailbreaking orientation. It represents community-level activity in the ongoing cat-and-mouse dynamic between AI safety guardrails and adversarial circumvention techniques. Limited technical detail is available from the trending snippet alone.

AI Safety Research Agent and Tool Ecosystem OBLITERATUS elder-plinius

4Hugging Face Blog·1mo ago·source ↗

Expert Support Case Study: Bolstering a RAG App with LLM-as-a-Judge

Hugging Face published a case study describing how Digital Green used an LLM-as-a-Judge approach to evaluate and improve a retrieval-augmented generation (RAG) application. The post covers the methodology for using LLMs to score and validate RAG outputs, providing a practical deployment pattern for quality assurance in production AI systems. It serves as a concrete example of enterprise-grade evaluation pipelines built on top of RAG architectures.

Evaluation and Benchmarking Enterprise Deployment Patterns LLM-as-a-Judge Digital Green Hugging Face +2 more