Structural role injection via Handlebars triple-brace interpolation in LLM prompts: empirical analysis across delimiter families and models
A new arXiv paper demonstrates that Handlebars templating's HTML auto-escaping—the default in Microsoft Semantic Kernel—provides uneven protection against structural role injection attacks, where attacker-controlled data carries chat role delimiters to forge higher-privilege turns. The authors conduct 5,760 trials across seven delimiter families, two attack objectives, and four models (GPT-3.5 Turbo, GPT-4o mini, GPT-4.1 mini, Claude Haiku 4.5), finding that HTML escaping neutralizes angle-bracket-based delimiters (ChatML, Llama-3, XML) but leaves colon- and Markdown-based families fully exposed. GPT-3.5 Turbo follows task-hijack instructions in 97% of raw and 91% of escaped trials; Claude Haiku 4.5 resists both objectives almost entirely. The paper concludes that HTML escaping cannot substitute for structural separation of instruction and data.
Related guides (4)
Related events (8)
Defending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)
Researchers from BAIR propose two fine-tuning-based defenses against prompt injection attacks: StruQ (Structured Instruction Tuning) and SecAlign (Special Preference Optimization). Both methods use a Secure Front-End with special delimiter tokens to separate trusted prompts from untrusted data, then fine-tune LLMs to ignore injected instructions. SecAlign, which uses DPO-style preference optimization, reduces attack success rates to under 15% against strong optimization-based attacks—more than 4x better than prior SOTA—while preserving model utility on AlpacaEval2.
The Shibboleth Effect: Cross-lingual behavioral skew in frontier LLMs under adversarial geopolitical simulation
Researchers introduce the 'Shibboleth Effect' — systematic behavioral differences in LLMs when operating in different languages — and audit six frontier models (GPT-4o, Llama-4, Mistral-Large, Gemini-3.1-Pro, Qwen3.6-Plus, DeepSeek-R1) using a synthetic maritime territorial dispute wargame played in English versus Turkish. Results are heterogeneous: Llama-4 becomes significantly more coercive in Turkish while Gemini-3.1-Pro and DeepSeek-R1 become less so, and GPT-4o shows no detectable shift. The study identifies two candidate buffering mechanisms — chain-of-thought institutional anchoring and multilingual RLHF alignment — with direct implications for deploying LLMs in diplomatic or crisis-management contexts.
LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models
LASH is a black-box jailbreak framework that adaptively composes outputs from multiple existing attack families into hybrid prompts using a genetic optimizer with a two-stage fitness function. Evaluated on JailbreakBench across six target models, LASH achieves 84.5% attack success rate (keyword-based) and 74.5% (LLM-judge) with only 30 mean target queries, outperforming five state-of-the-art baselines. The work demonstrates that no single jailbreak family dominates across models and harm categories, and that adaptive cross-strategy composition is a promising red-teaming direction. Results hold under three defense mechanisms.
Semantic Triplet Restoration: A Novel Protocol for Hierarchical Table Understanding in Large Language Models
This paper proposes Semantic Triplet Restoration (STR), a table serialization protocol that rewrites each cell as an atomic fact <item path, feature path, value> to make header-cell alignments explicit for LLMs, replacing HTML/Markdown representations. The authors also introduce TripletQL, a query-aware router that selects relevant triplets per question. Evaluated on four Chinese and English table-QA benchmarks, STR matches or outperforms HTML-based baselines while reducing input token count. Benefits are most pronounced for smaller models and longer tables, suggesting value under constrained inference budgets.
Chat Templates: An End to the Silent Performance Killer
This Hugging Face blog post addresses the problem of inconsistent chat formatting across language models, where mismatched prompt templates silently degrade model performance. It introduces a standardized chat template system in the transformers library that encodes each model's expected conversation format directly into its tokenizer. The post argues that using the wrong chat format can cause significant but hard-to-detect performance drops, making standardization critical for reliable deployment.
Continuously hardening ChatGPT Atlas against prompt injection
OpenAI is applying automated red teaming trained with reinforcement learning to harden ChatGPT Atlas, its browser agent, against prompt injection attacks. The approach creates a proactive discover-and-patch loop to identify novel exploits before they can be weaponized. This work is framed as part of broader efforts to secure increasingly agentic AI systems against adversarial manipulation of external content.
Anthropic Details Collaboration with US CAISI and UK AISI on Constitutional Classifier Red-Teaming
Anthropic has published an account of its ongoing voluntary partnership with the US Center for AI Standards and Innovation (CAISI) and UK AI Security Institute (AISI), in which government red-teamers were given deep access to pre-deployment versions of Constitutional Classifiers used on Claude Opus 4 and 4.1. The collaboration uncovered multiple vulnerability classes including prompt injection bypasses, cipher-based obfuscation attacks, universal jailbreaks via automated attack refinement, and input/output fragmentation exploits, each of which drove architectural improvements to Anthropic's safeguard systems. Key lessons shared include the value of providing unprotected model variants, real-time classifier score access, and detailed internal documentation to enable targeted red-teaming. The announcement frames government partnership as a core component of Anthropic's Safeguards approach rather than a one-off audit.
Google Study Shows LLM-Generated Malware Is Getting Harder to Track and Stop
A Google security report catalogs emerging LLM-enabled cyberattack techniques including morphing malware with mutation engines, logical-flaw discovery in code, and AI-directed obfuscation networks. The report was prompted in part by a real incident where hackers used an LLM to find a zero-day in a widely used web administration tool. Separately, the UK AI Security Institute found that Claude Mythos Preview and GPT-5.5 can reliably execute attacks expected to take humans 3 hours, up from earlier 1-hour benchmarks, with performance scaling further when token limits are relaxed. The findings suggest an accelerating gap between LLM offensive capability and conventional defensive tooling.



