4Hugging Face Blog·1mo ago

Evaluating Language Model Bias with 🤗 Evaluate

This Hugging Face blog post introduces tooling and methodology for evaluating bias in language models using the Evaluate library. It covers bias measurement approaches and how practitioners can apply them to assess fairness properties of LLMs. The post is oriented toward applied practitioners working with open-source models.

Evaluation and Benchmarking AI Safety Research Hugging Face Evaluate Hugging Face

Related guides (3)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6Openai Blog·1mo ago·source ↗

Defining and Evaluating Political Bias in LLMs

OpenAI has published a post describing their methodology for evaluating political bias in ChatGPT, introducing new real-world testing approaches aimed at improving objectivity and reducing bias. The piece outlines how OpenAI defines political bias in the context of large language models and the evaluation frameworks they are developing to measure it. This represents OpenAI's public commitment to systematic bias measurement as a component of responsible deployment.

Evaluation and Benchmarking AI Safety Research political bias evaluation ChatGPT OpenAI +1 more

4Hugging Face Blog·1mo ago·source ↗

Very Large Language Models and How to Evaluate Them

This Hugging Face blog post from October 2022 discusses approaches to zero-shot evaluation of large language models hosted on the Hub. It covers methodologies for benchmarking LLMs without task-specific fine-tuning, addressing the practical challenges of evaluating very large models at scale. The post situates evaluation tooling within the broader ecosystem of open model hosting and assessment.

Evaluation and Benchmarking Open Weights Progress zero-shot evaluation Hugging Face

4Hugging Face Blog·1mo ago·source ↗

Ethics and Society Newsletter #4: Bias in Text-to-Image Models

Hugging Face's Ethics and Society team publishes their fourth newsletter focusing on bias in text-to-image generative models. The piece examines how these models encode and reproduce societal biases in visual outputs, likely covering evaluation methods, documented failure modes, and mitigation approaches. As a Tier 2 commentary piece from a major ML platform, it contributes to ongoing discourse around fairness and safety in multimodal AI systems.

Evaluation and Benchmarking AI Safety Research Hugging Face Ethics and Society Team text-to-image models Hugging Face +1 more

4Hugging Face Blog·1mo ago·source ↗

Red-Teaming Large Language Models

This Hugging Face blog post introduces red-teaming as a safety evaluation methodology for large language models, explaining how adversarial testing can surface harmful outputs, biases, and failure modes before deployment. It covers techniques for systematically probing LLMs to elicit problematic behaviors and discusses the role of red-teaming in responsible AI development. The post serves as an educational overview aimed at practitioners working on LLM safety.

Evaluation and Benchmarking AI Safety Research large language models Hugging Face red-teaming

5arXiv · cs.CL·47h ago·source ↗

StylisticBias benchmark reveals a small set of visual cues drives most social bias in MLLMs

Researchers introduce StylisticBias, a controlled benchmark of ~25K photorealistic face images with single-attribute variations designed to isolate how specific visual cues shift social judgments in multimodal LLMs. Evaluating six MLLMs across 25 binary social judgment scenarios, they find that age and body type dominate identity-level effects, while fashion style drives the largest attribute-level shifts, with ~15 attributes accounting for ~80% of total bias variation. The benchmark is released publicly on GitHub and Hugging Face, enabling fine-grained bias auditing of multimodal models.

Evaluation and Benchmarking AI Safety Research StylisticBias StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs +1 more

7arXiv · cs.CL·29d ago·source ↗

AMEL: Accumulated Message Effects Bias LLM Judgments in Multi-Turn Evaluation Pipelines

This paper introduces AMEL (Accumulated Message Effect on LLM Judgments), documenting that prior conversation history with predominantly positive or negative evaluations systematically biases subsequent LLM judgments toward the prevailing polarity. Across 75,898 API calls to 11 models from 4 providers, the effect is statistically robust (d = -0.17, p < 10^-46), concentrates on high-uncertainty items, and shows a negativity asymmetry where negative histories induce 1.62x more bias than positive ones. Critically, the bias does not grow with context length, scaling reduces but does not eliminate it, and the simplest mitigation is using a fresh context per evaluation item.

Evaluation and Benchmarking AI Safety Research Claude Opus 4.6 Google Claude Haiku 4.5 +7 more

5Hugging Face Blog·1mo ago·source ↗

Judge Arena: Benchmarking LLMs as Evaluators

Hugging Face and Atla have launched Judge Arena, a platform for benchmarking large language models in their role as automated evaluators. The initiative uses an Elo-based ranking system to compare how well different LLMs judge the quality of model outputs, addressing the growing reliance on LLM-as-judge paradigms in evaluation pipelines. This fills a meta-evaluation gap: as LLM judges become standard practice, understanding their relative reliability and biases becomes critical infrastructure for the field.

Evaluation and Benchmarking Agent and Tool Ecosystem LLM-as-a-Judge Judge Arena Hugging Face +2 more

8Openai Blog·1mo ago·source ↗

Evaluating Large Language Models Trained on Code

OpenAI published research on evaluating large language models trained on code, introducing the Codex model and the HumanEval benchmark for assessing code generation capabilities. The work established foundational methodology for measuring functional correctness of code produced by LLMs using a pass@k metric. This paper became a landmark reference for code-focused LLM evaluation and influenced subsequent code generation research across the field.

Frontier Model Releases Evaluation and Benchmarking GPT-3 pass@k OpenAI +3 more