paper

Can LLMs Reliably Self-Report Adversarial Prefills, and How?

paperactiveprovisionalcan-llms-reliably-self-report-adversarial-prefills-and-how--afaa79bd·1 events·first seen 36h ago

Aliases: Can LLMs Reliably Self-Report Adversarial Prefills, and How?

Co-occurring entities

GRPO DPO SFT

More like this (12)

Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond Be My Tutor: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback Latent Adversarial Robustification (LAR)LLM Pretraining Measuring Epistemic Resilience of LLMs Under Misleading Medical Context Beyond Third-Person Audits: Situated Interaction Auditing for User-Centered LLM Bias Research What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?Learning from the Self-future: On-policy Self-distillation for dLLMs How reliable are LLMs when it comes to playing dice?How reliable are LLMs when it comes to playing dice?Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?Flaws in the LLM Automation Narrative

Recent events (1)

6arXiv · cs.CL·36h ago·source ↗

LLMs fail to reliably self-report adversarial prefill attacks, study finds

A new arXiv paper evaluates whether LLMs can recognize that their own prior responses were elicited by adversarial prefill attacks, testing ten open-weight models (3B–70B) across four safety benchmarks. Models claim intent on prefilled responses only 27.3% of the time on average, and introspective signal is largely mediated by refusal-related reasoning. Three LoRA fine-tuning methods (SFT, GRPO, DPO) improve the intention-probe gap but counterintuitively raise attack success rates on most models, suggesting partial and fragile mitigation. The findings raise concerns about the reliability of LLM self-reports in safety-critical contexts.

Evaluation and Benchmarking AI Safety Research GRPO DPO Can LLMs Reliably Self-Report Adversarial Prefills, and How?+1 more