5arXiv cs.CL (Computation and Language)·47h ago

StylisticBias benchmark reveals a small set of visual cues drives most social bias in MLLMs

Researchers introduce StylisticBias, a controlled benchmark of ~25K photorealistic face images with single-attribute variations designed to isolate how specific visual cues shift social judgments in multimodal LLMs. Evaluating six MLLMs across 25 binary social judgment scenarios, they find that age and body type dominate identity-level effects, while fashion style drives the largest attribute-level shifts, with ~15 attributes accounting for ~80% of total bias variation. The benchmark is released publicly on GitHub and Hugging Face, enabling fine-grained bias auditing of multimodal models.

Evaluation and Benchmarking AI Safety Research Multimodal Progress StylisticBias StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·18d ago·source ↗

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

This paper identifies and analyzes 'Perceptual Judgment Bias' in multimodal LLM judges, where models anchor on response text rather than visual evidence when the two conflict. The authors introduce a Perceptually Perturbed Judgment Dataset using counterfactual responses to isolate perceptual errors, and a training framework combining GRPO-based reward modeling with batch-ranking objectives. Experiments on MLLM-as-a-Judge benchmarks show improved perceptual fidelity, ranking coherence, and alignment with human evaluation.

Evaluation and Benchmarking Alignment and RLHF Perceptually Perturbed Judgment Dataset Multimodal Large Language Models GRPO +3 more

4Hugging Face Blog·1mo ago·source ↗

Ethics and Society Newsletter #4: Bias in Text-to-Image Models

Hugging Face's Ethics and Society team publishes their fourth newsletter focusing on bias in text-to-image generative models. The piece examines how these models encode and reproduce societal biases in visual outputs, likely covering evaluation methods, documented failure modes, and mitigation approaches. As a Tier 2 commentary piece from a major ML platform, it contributes to ongoing discourse around fairness and safety in multimodal AI systems.

Evaluation and Benchmarking AI Safety Research Hugging Face Ethics and Society Team text-to-image models Hugging Face +1 more

6arXiv · cs.CL·19d ago·source ↗

Vision-Language Models Suppress Female Representations Under Ambiguous Input

This paper investigates gender bias in vision-language models (VLMs) when inputs are ambiguous (e.g., workers in full gear or seen from behind), finding that models default to male outputs even for strongly female-stereotyped occupations. The authors introduce LALS (Latent Association Leaning Score), a zero-shot metric that probes internal visual-token activations to measure concept associations across layers. Across 15 occupations, 800+ ambiguous images, and four VLMs, they find a systematic decoupling: models internally encode female associations but suppress them before generation, with male signals amplifying end-to-end while female signals peak mid-network and are filtered out. Cultural visual cues like clothing color further modulate these internal associations.

Evaluation and Benchmarking AI Safety Research gender bias in VLMs Vision-Language Models visual-token activation probing +5 more

4Hugging Face Blog·1mo ago·source ↗

Evaluating Language Model Bias with 🤗 Evaluate

This Hugging Face blog post introduces tooling and methodology for evaluating bias in language models using the Evaluate library. It covers bias measurement approaches and how practitioners can apply them to assess fairness properties of LLMs. The post is oriented toward applied practitioners working with open-source models.

Evaluation and Benchmarking AI Safety Research Hugging Face Evaluate Hugging Face

5arXiv · cs.CL·15d ago·source ↗

LLMs fail to consistently simulate demographic perspective-taking in hate speech annotation

A new arXiv paper evaluates whether persona-conditioned LLMs can replicate how different demographic groups perceive hate speech, testing three dimensions: inter-group disagreement, in-group sensitivity, and vicarious prediction. No model consistently captures all three dimensions, and performance is highly model-dependent rather than emerging reliably from identity prompts alone. Vicarious prompting with Llama 3.1 provides the closest approximation to human disagreement patterns across demographic axes. The findings have implications for using LLMs as proxies for diverse human annotators in content moderation tasks.

Evaluation and Benchmarking AI Safety Research From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation Meta Llama-3.1-8B

4arXiv · cs.CL·24d ago·source ↗

C4STYLI Benchmark: Probing Cultural Aesthetic Stylistics Awareness in LLMs

Researchers introduce C4STYLI, a benchmark of stylized translated movie titles and advertising slogans from Hong Kong and mainland China, designed to evaluate LLMs on cross-cultural aesthetic stylistics. Evaluations reveal that LLMs diverge from human stylistic recognition, with recognition ability varying by text domain and not consistently predicting generation performance. Structural ablation using logistic regression probes shows that LLMs in the Hong Kong setting rely on surface-level linguistic cues rather than deeper stylistic structure, indicating limited cultural sensitivity.

Evaluation and Benchmarking C4STYLI large language models logistic regression probes +2 more

5arXiv · cs.CL·9d ago·source ↗

Situated Interaction Auditing: A user-centered framework for LLM bias research

Researchers propose Situated Interaction Auditing (SIA), a new framework for studying LLM bias from the perspective of the user rather than third-party demographic representation. The core insight is that bias can manifest in how a model treats its interlocutor — varying response quality, content, and tone based on implicit sociodemographic signals, writing style, or stated identity — rather than only in how it describes external groups. The paper demonstrates SIA through a case study intersecting gender and socioeconomic status signals across multiple task domains and outlines a research agenda for the approach.

Evaluation and Benchmarking AI Safety Research Beyond Third-Person Audits: Situated Interaction Auditing for User-Centered LLM Bias Research Situated Interaction Auditing

7arXiv · cs.CL·29d ago·source ↗

AMEL: Accumulated Message Effects Bias LLM Judgments in Multi-Turn Evaluation Pipelines

This paper introduces AMEL (Accumulated Message Effect on LLM Judgments), documenting that prior conversation history with predominantly positive or negative evaluations systematically biases subsequent LLM judgments toward the prevailing polarity. Across 75,898 API calls to 11 models from 4 providers, the effect is statistically robust (d = -0.17, p < 10^-46), concentrates on high-uncertainty items, and shows a negativity asymmetry where negative histories induce 1.62x more bias than positive ones. Critically, the bias does not grow with context length, scaling reduces but does not eliminate it, and the simplest mitigation is using a fresh context per evaluation item.

Evaluation and Benchmarking AI Safety Research Claude Opus 4.6 Google Claude Haiku 4.5 +7 more