6Berkeley AI Research (BAIR) Blog·1mo ago

Defending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)

Researchers from BAIR propose two fine-tuning-based defenses against prompt injection attacks: StruQ (Structured Instruction Tuning) and SecAlign (Special Preference Optimization). Both methods use a Secure Front-End with special delimiter tokens to separate trusted prompts from untrusted data, then fine-tune LLMs to ignore injected instructions. SecAlign, which uses DPO-style preference optimization, reduces attack success rates to under 15% against strong optimization-based attacks—more than 4x better than prior SOTA—while preserving model utility on AlpacaEval2.

AI Safety Research Agent and Tool Ecosystem Alignment and RLHF StruQ SecAlign Berkeley AI Research (BAIR)Instruction Hierarchy Llama3-8B-Instruct Direct Preference Optimization (DPO)AlpacaEval 2 OpenAI Sizhe Chen

Related guides (4)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

Direct Preference Optimization (DPO)Concept

Direct Preference Optimization (DPO): Reward-Free Alignment for LLMs

Read asIn-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Related events (8)

5Openai Blog·1mo ago·source ↗

Understanding prompt injections: a frontier security challenge

OpenAI has published a blog post addressing prompt injection attacks as a key security challenge for AI systems. The post covers how these attacks work and outlines OpenAI's multi-pronged approach including research, model training improvements, and safeguard development. This signals OpenAI's formal positioning on agentic security threats as their models are increasingly deployed in tool-using and autonomous contexts.

AI Safety Research Agent and Tool Ecosystem prompt injection OpenAI

6arXiv · cs.CL·3d ago·source ↗

Structural role injection via Handlebars triple-brace interpolation in LLM prompts: empirical analysis across delimiter families and models

A new arXiv paper demonstrates that Handlebars templating's HTML auto-escaping—the default in Microsoft Semantic Kernel—provides uneven protection against structural role injection attacks, where attacker-controlled data carries chat role delimiters to forge higher-privilege turns. The authors conduct 5,760 trials across seven delimiter families, two attack objectives, and four models (GPT-3.5 Turbo, GPT-4o mini, GPT-4.1 mini, Claude Haiku 4.5), finding that HTML escaping neutralizes angle-bracket-based delimiters (ChatML, Llama-3, XML) but leaves colon- and Markdown-based families fully exposed. GPT-3.5 Turbo follows task-hijack instructions in 97% of raw and 91% of escaped trials; Claude Haiku 4.5 resists both objectives almost entirely. The paper concludes that HTML escaping cannot substitute for structural separation of instruction and data.

AI Safety Research Agent and Tool Ecosystem Microsoft Semantic Kernel GPT-3.5 Turbo GPT-4.1 mini +7 more

6Openai Blog·1mo ago·source ↗

Designing AI agents to resist prompt injection

OpenAI published a blog post describing how ChatGPT's agent workflows are designed to resist prompt injection and social engineering attacks. The approach focuses on constraining risky actions and protecting sensitive data within agentic pipelines. This represents OpenAI's public articulation of defensive design principles for deployed AI agents.

AI Safety Research Enterprise Deployment Patterns prompt injection ChatGPT social engineering +2 more

5Hugging Face Blog·1mo ago·source ↗

Preference Tuning LLMs with Direct Preference Optimization Methods

A Hugging Face blog post surveys Direct Preference Optimization (DPO) and related preference tuning methods for aligning large language models. The post covers the landscape of DPO variants and their practical application via the TRL library. It serves as a technical reference for practitioners implementing RLHF alternatives.

Agent and Tool Ecosystem Alignment and RLHF Reinforcement Learning from Human Feedback Direct Preference Optimization (DPO)Hugging Face +1 more

5arXiv · cs.LG·46h ago·source ↗

Probe-and-Refine Tuning improves coding agent performance via iterative repository guidance refinement

A new arXiv paper introduces probe-and-refine tuning, a procedure that uses synthetic bug-fix probes to iteratively improve AGENTS.md repository guidance files for LLM-based coding agents without requiring an agent loop during tuning. Evaluated on SWE-bench Verified with Qwen3.5-35B-A3B, the method achieves 33.0% mean resolve rate versus 28.3% for a static knowledge base baseline and 25.5% for an unguided baseline. The improvement is attributed to coverage gains—refined guidance helps agents locate the correct files rather than improving patch quality—and a step-budget experiment shows guidance is necessary for agents to productively use larger compute budgets.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen3.5-35B-A3B SWE-Bench Verified NVIDIA Nemotron-3-Nano-30B-A3B +2 more

6arXiv · cs.AI·18d ago·source ↗

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

SafeSteer proposes a safety alignment method that targets only 'safety tokens' in the output distribution rather than applying global fine-tuning, arguing that safety features are inherently sparse. It constructs a safety teacher via activation steering, then restricts a reverse KL penalty to selected safety tokens during training. The approach achieves strong safety performance across seven benchmarks with minimal capability degradation, requiring only 100 harmful samples—less than 1% of data used by prior baselines.

Evaluation and Benchmarking AI Safety Research on-policy distillation SafeSteer alignment tax +4 more

4Hugging Face Blog·1mo ago·source ↗

Improving Prompt Consistency with Structured Generations

This Hugging Face blog post examines how structured generation outputs can improve consistency in LLM evaluation pipelines. It explores techniques for constraining model outputs to specific formats, reducing variability in prompt-based assessments. The post addresses a practical challenge in evaluation workflows where inconsistent response formats degrade measurement reliability.

Evaluation and Benchmarking Agent and Tool Ecosystem LLM evaluation structured output generation Hugging Face

7Openai Blog·1mo ago·source ↗

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

OpenAI published research on the 'instruction hierarchy,' a training approach that teaches LLMs to prioritize instructions based on their source privilege level (system prompt > user > third-party). The method aims to make models more robust against prompt injection, jailbreaks, and adversarial instruction overrides. By training models to recognize and respect a hierarchy of instruction authority, OpenAI seeks to reduce the attack surface for multi-agent and deployed LLM systems.

AI Safety Research Enterprise Deployment Patterns prompt injection Instruction Hierarchy OpenAI +3 more