5Simon Willison's Weblog·10d ago

Simon Willison frames prompt injection as a role confusion problem

Simon Willison publishes a commentary piece reframing prompt injection attacks as fundamentally a problem of role confusion in LLM systems, where models fail to distinguish between trusted instructions and untrusted data. The piece offers a conceptual lens for understanding why prompt injection is structurally difficult to solve. This framing has implications for how developers and researchers approach mitigations.

AI Safety Research Agent and Tool Ecosystem prompt injection Simon Willison

Related guides (4)

prompt injectionConcept

Prompt Injection: The Security Threat Hiding in Plain Text

Read asBeginner In-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Principles to Geopolitical Flashpoint

Read asIn-depth

Simon Willison

Simon Willison: The Practitioner's Guide to the AI Landscape

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: From Chatbots to Autonomous Pipelines

Read asIn-depth

Related events (8)

5Openai Blog·1mo ago·source ↗

Understanding prompt injections: a frontier security challenge

OpenAI has published a blog post addressing prompt injection attacks as a key security challenge for AI systems. The post covers how these attacks work and outlines OpenAI's multi-pronged approach including research, model training improvements, and safeguard development. This signals OpenAI's formal positioning on agentic security threats as their models are increasingly deployed in tool-using and autonomous contexts.

AI Safety Research Agent and Tool Ecosystem prompt injection OpenAI

6Simon Willison'S Weblog·6d ago·source ↗

Simon Willison reports on prompt injection and adversarial attacks after 2,000 people tried to hack his AI assistant

Simon Willison documents the results of a public experiment in which approximately 2,000 people attempted to compromise or manipulate his personal AI assistant. The post covers the attack patterns observed, what succeeded or failed, and lessons learned about prompt injection and adversarial robustness in deployed AI systems. This is a practical, first-hand account of real-world AI security challenges from a respected practitioner.

AI Safety Research Enterprise Deployment Patterns Simon Willison

5arXiv · cs.AI·7d ago·source ↗

Prompt injection attacks on LLM-based résumé screening: effectiveness and fairness implications

A new arXiv paper studies prompt injection in automated résumé screening, where candidates embed subtle self-promotional text to manipulate LLM rankings without adding genuine qualifications. Controlled experiments show injection reliably boosts rankings when manipulation is rare and candidate quality is homogeneous, but effectiveness collapses as adoption spreads. The work raises fairness concerns because lower-quality candidates can occasionally outrank higher-quality ones, and identifies conditions under which LLM-based hiring systems are most vulnerable.

Evaluation and Benchmarking AI Safety Research Prompt Injection in Automated Résumé Screening with Large Language Models: Single and Multi-Injection Settings

6Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

Defending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)

Researchers from BAIR propose two fine-tuning-based defenses against prompt injection attacks: StruQ (Structured Instruction Tuning) and SecAlign (Special Preference Optimization). Both methods use a Secure Front-End with special delimiter tokens to separate trusted prompts from untrusted data, then fine-tune LLMs to ignore injected instructions. SecAlign, which uses DPO-style preference optimization, reduces attack success rates to under 15% against strong optimization-based attacks—more than 4x better than prior SOTA—while preserving model utility on AlpacaEval2.

AI Safety Research Agent and Tool Ecosystem StruQ SecAlign Berkeley AI Research (BAIR)+7 more

6arXiv · cs.CL·16d ago·source ↗

Structural role injection via Handlebars triple-brace interpolation in LLM prompts: empirical analysis across delimiter families and models

A new arXiv paper demonstrates that Handlebars templating's HTML auto-escaping—the default in Microsoft Semantic Kernel—provides uneven protection against structural role injection attacks, where attacker-controlled data carries chat role delimiters to forge higher-privilege turns. The authors conduct 5,760 trials across seven delimiter families, two attack objectives, and four models (GPT-3.5 Turbo, GPT-4o mini, GPT-4.1 mini, Claude Haiku 4.5), finding that HTML escaping neutralizes angle-bracket-based delimiters (ChatML, Llama-3, XML) but leaves colon- and Markdown-based families fully exposed. GPT-3.5 Turbo follows task-hijack instructions in 97% of raw and 91% of escaped trials; Claude Haiku 4.5 resists both objectives almost entirely. The paper concludes that HTML escaping cannot substitute for structural separation of instruction and data.

AI Safety Research Agent and Tool Ecosystem Microsoft Semantic Kernel GPT-3.5 Turbo GPT-4.1 mini +7 more

4Hugging Face Blog·1mo ago·source ↗

How Long Prompts Block Other Requests - Optimizing LLM Performance

This Hugging Face blog post from TNG Technology Consulting examines how long prompts create head-of-line blocking in LLM serving systems, degrading latency for concurrent requests. The post analyzes the mechanics of prompt processing in inference pipelines and discusses optimization strategies to mitigate throughput bottlenecks caused by lengthy context inputs. It is framed as a practical guide for teams deploying LLMs in production environments where mixed prompt-length workloads are common.

Long Context Evolution Inference Economics Hugging Face TNG Technology Consulting +1 more

6arXiv · cs.CL·1mo ago·source ↗

Instruction Sensitivity Undermines Embedding Model Evaluation: Single-Prompt Benchmarks Are Insufficient

This paper presents an empirical study of prompt sensitivity in instruction-tuned embedding models, covering 6 models, 11 datasets, and 15 task-specific prompts per dataset (990 total evaluations). The authors demonstrate that single-prompt evaluation systematically misrepresents true model performance, with default prompts both understating and overstating capabilities depending on phrasing. A key finding is that leaderboard rankings are not robust: by selecting prompts favorably, any model in the study can be promoted to first place. The authors recommend that benchmarks incorporate prompt robustness metrics, either through multi-prompt evaluation or by reporting sensitivity alongside point estimates.

Evaluation and Benchmarking Agent and Tool Ecosystem MTEB embedding model leaderboard prompt sensitivity +1 more

5Simon Willison'S Weblog·23d ago·source ↗

Simon Willison on Claude Fable's silent refusal transparency problem

Simon Willison writes about a concern with Claude Fable's behavior: when the model stops helping a user, it does so without clear explanation, leaving users unaware of why assistance was withheld. The piece raises questions about transparency and user agency in AI refusal mechanisms. This touches on broader issues of how frontier models communicate their limitations and safety behaviors to end users.

Frontier Model Releases AI Safety Research Claude Fable Simon Willison Anthropic