paper

It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

paperactiveprovisionalit-takes-one-to-bias-them-all-breaking-bad-with-one-shot-grpo-b53170b1·1 events·first seen 7d ago

Aliases: It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

Co-occurring entities

GRPO (Group Relative Policy Optimization)

More like this (12)

leave-one-out baseline GRPO GRPO (Group Relative Policy Optimization)OFA (One-For-All)Benchmark Everything Everywhere All at Once One-Shot Imitation Learning Latent-Anchored GRPO few-shot prompting Best-of-N Sampling N-GRPO GSPO (Group Sequence Policy Optimization)Fixed-Payload Poisoning

Recent events (1)

7arXiv · cs.CL·7d ago·source ↗

One-shot GRPO training on a single biased example can break LLM alignment

A new arXiv paper demonstrates that a single biased training example using Group Relative Policy Optimization (GRPO) is sufficient to induce systematic bias in aligned LLMs, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. The authors find that model susceptibility varies based on the initial likelihood of producing biased outputs. The result exposes a critical vulnerability in post-training alignment: a minimal fine-tuning intervention can override safety guardrails.

AI Safety Research Alignment and RLHF It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO GRPO (Group Relative Policy Optimization)