technique
AdvGRPO
techniqueactiveprovisional
advgrpo-84697b44·1 events·first seen 8d agoAliases: AdvGRPO
Co-occurring entities
More like this (12)
Recent events (1)
AdvGRPO: Stable co-training framework for adaptive red teaming of language models
Researchers introduce AdvGRPO, a co-training framework that makes GRPO viable for joint attacker-defender optimization in LLM red teaming, addressing previously reported instability. The method uses dense multi-channel rewards and decoupled advantage normalization, with a curriculum progressing from single-turn to multi-turn attacks before bootstrapping co-training. Co-trained defenders outperform baselines on safety benchmarks, and the attacks show transferability across models.