Almanac
paper

It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

paperactiveprovisionalit-takes-one-to-bias-them-all-breaking-bad-with-one-shot-grpo-b53170b1·1 events·first seen 7d ago

Aliases: It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

Co-occurring entities

More like this (12)

Recent events (1)

7arXiv · cs.CL·7d ago·source ↗

One-shot GRPO training on a single biased example can break LLM alignment

A new arXiv paper demonstrates that a single biased training example using Group Relative Policy Optimization (GRPO) is sufficient to induce systematic bias in aligned LLMs, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. The authors find that model susceptibility varies based on the initial likelihood of producing biased outputs. The result exposes a critical vulnerability in post-training alignment: a minimal fine-tuning intervention can override safety guardrails.