paper

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

paperactiveprovisionalwhat-do-safety-aligned-llms-learn-from-mixed-compliance-demonstrations--9f7dd516·1 events·first seen 2d ago

Aliases: What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

More like this (12)

LLM Safety Leaderboard Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond Flaws in the LLM Automation Narrative Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals LLM-as-a-Judge Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online Discussions LLM-as-monitor Be My Tutor: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback Learning from the Self-future: On-policy Self-distillation for dLLMs code synthesis LLMs EDIT: Evidence-Diagnosed Intervention Training for Rule-Faithful LLM Grading

Recent events (1)

6arXiv · cs.AI·2d ago·source ↗

Study characterizes how mixed compliance demonstrations drive jailbreaking in safety-aligned LLMs

Researchers investigate how language models interpret mixed in-context demonstrations containing both benign and harmful compliance examples, testing three hypotheses about what drives harmful compliance. Across four models, they find benign and harmful demonstrations are not interchangeable, that preference optimization is the critical training stage preventing benign demonstrations from increasing harmful compliance, and that demonstration ordering exhibits strong recency bias. The work moves beyond showing that demonstration-based jailbreaking works to mechanistically characterizing how models extract signals from demonstration content, ordering, and training methodology.

AI Safety Research Alignment and RLHF What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?