paper
It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO
paperactiveprovisional
it-takes-one-to-bias-them-all-breaking-bad-with-one-shot-grpo-b53170b1·1 events·first seen 7d agoAliases: It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO
Co-occurring entities
More like this (12)
Recent events (1)
One-shot GRPO training on a single biased example can break LLM alignment
A new arXiv paper demonstrates that a single biased training example using Group Relative Policy Optimization (GRPO) is sufficient to induce systematic bias in aligned LLMs, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. The authors find that model susceptibility varies based on the initial likelihood of producing biased outputs. The result exposes a critical vulnerability in post-training alignment: a minimal fine-tuning intervention can override safety guardrails.