Almanac
paper

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

paperactiveprovisionalwhat-do-safety-aligned-llms-learn-from-mixed-compliance-demonstrations--9f7dd516·1 events·first seen 2d ago

Aliases: What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

More like this (12)

Recent events (1)

6arXiv · cs.AI·2d ago·source ↗

Study characterizes how mixed compliance demonstrations drive jailbreaking in safety-aligned LLMs

Researchers investigate how language models interpret mixed in-context demonstrations containing both benign and harmful compliance examples, testing three hypotheses about what drives harmful compliance. Across four models, they find benign and harmful demonstrations are not interchangeable, that preference optimization is the critical training stage preventing benign demonstrations from increasing harmful compliance, and that demonstration ordering exhibits strong recency bias. The work moves beyond showing that demonstration-based jailbreaking works to mechanistically characterizing how models extract signals from demonstration content, ordering, and training methodology.