Introducing the Red-Teaming Resistance Leaderboard
Hugging Face and Haize Labs have launched a Red-Teaming Resistance Leaderboard to systematically benchmark how well AI models resist adversarial prompting and jailbreak attempts. The leaderboard provides a standardized evaluation framework for comparing model robustness against red-teaming attacks. This fills a gap in the evaluation ecosystem where safety and adversarial robustness metrics have been less formalized than capability benchmarks.
Related guides (3)
Related events (8)
Red-Teaming Large Language Models
This Hugging Face blog post introduces red-teaming as a safety evaluation methodology for large language models, explaining how adversarial testing can surface harmful outputs, biases, and failure modes before deployment. It covers techniques for systematically probing LLMs to elicit problematic behaviors and discusses the role of red-teaming in responsible AI development. The post serves as an educational overview aimed at practitioners working on LLM safety.
Advancing Red Teaming with People and AI
OpenAI published a blog post describing advances in their red teaming methodology, combining human red teamers with AI-assisted approaches. The post outlines how AI tools are being integrated into the red teaming pipeline to improve coverage and efficiency of safety evaluations. This represents an evolution in OpenAI's pre-deployment safety testing practices.
OpenAI Red Teaming Network
OpenAI is launching an open call for a Red Teaming Network, inviting domain experts to participate in ongoing safety evaluations of its models. The initiative aims to build a structured community of external red teamers who can help identify risks and failure modes across OpenAI's model releases. This represents a formalization of OpenAI's external adversarial testing program beyond one-off pre-release red teaming exercises.
An Introduction to AI Secure LLM Safety Leaderboard
Hugging Face introduces the DecodingTrust-based LLM Safety Leaderboard, a benchmark framework for evaluating large language models across multiple safety and trustworthiness dimensions. The leaderboard aims to provide standardized, reproducible safety assessments covering areas such as toxicity, stereotype bias, adversarial robustness, and privacy. It offers a public ranking of models to help researchers and practitioners compare safety properties across different LLMs.
Community Evals: Because we're done trusting black-box leaderboards over the community
Hugging Face introduces Community Evals, a framework aimed at replacing or supplementing opaque black-box leaderboards with community-driven model evaluations. The initiative reflects growing skepticism about the reliability and transparency of existing benchmark leaderboards. By crowdsourcing evaluations, Hugging Face seeks to make model assessment more transparent, diverse, and resistant to gaming. This represents a structural shift in how the open-source AI community approaches model comparison and trust.
The Open Agent Leaderboard
IBM Research and Hugging Face have launched the Open Agent Leaderboard, a public benchmark for evaluating AI agents across standardized tasks. The leaderboard aims to provide transparent, reproducible comparisons of open and proprietary agent systems. This initiative addresses the growing need for rigorous evaluation infrastructure as the agent ecosystem matures.
Adding Benchmaxxer Repellant to the Open ASR Leaderboard
Hugging Face describes measures taken to prevent benchmark gaming ('benchmaxxing') on the Open ASR Leaderboard by introducing private or held-out evaluation data. The post addresses the integrity of automatic speech recognition benchmarks, where models may be overfitted or tuned specifically to public test sets. This is part of a broader effort to maintain meaningful leaderboard rankings as ASR model submissions increase.
Anthropic details red teaming methods and calls for standardized AI testing practices
Anthropic published a detailed overview of red teaming approaches used to test Claude and other AI systems, covering domain-specific expert testing, automated red teaming, multilingual/multicultural testing, and multimodal red teaming. The post documents empirical findings about when each method is appropriate, highlights partnerships with organizations like Thorn, Institute for Strategic Dialogue, and Singapore's IMDA, and closes with policy recommendations for building a standardized AI testing ecosystem. The piece is notable for its operational specificity and its explicit call for industry-wide standards to enable cross-system safety comparisons.


