Almanac
technique

prompt injection

techniqueactiveprompt-injection-e531ddb3·12 events·first seen 1mo ago

Aliases: prompt injection

Co-occurring entities

More like this (12)

Guides (1)

Recent events (12)

6Openai Blog·1mo ago·source ↗

Designing AI agents to resist prompt injection

OpenAI published a blog post describing how ChatGPT's agent workflows are designed to resist prompt injection and social engineering attacks. The approach focuses on constraining risky actions and protecting sensitive data within agentic pipelines. This represents OpenAI's public articulation of defensive design principles for deployed AI agents.

5Openai Blog·1mo ago·source ↗

Understanding prompt injections: a frontier security challenge

OpenAI has published a blog post addressing prompt injection attacks as a key security challenge for AI systems. The post covers how these attacks work and outlines OpenAI's multi-pronged approach including research, model training improvements, and safeguard development. This signals OpenAI's formal positioning on agentic security threats as their models are increasingly deployed in tool-using and autonomous contexts.

5Openai Blog·1mo ago·source ↗

Keeping your data safe when an AI agent clicks a link

OpenAI published a blog post describing safeguards built into its AI agent systems to prevent URL-based data exfiltration and prompt injection attacks when agents follow links. The post outlines how OpenAI protects user data during agentic browsing or link-following actions. This addresses a known attack surface in autonomous agent deployments where malicious links could be used to leak context or hijack agent behavior.

6Openai Blog·1mo ago·source ↗

Continuously hardening ChatGPT Atlas against prompt injection

OpenAI is applying automated red teaming trained with reinforcement learning to harden ChatGPT Atlas, its browser agent, against prompt injection attacks. The approach creates a proactive discover-and-patch loop to identify novel exploits before they can be weaponized. This work is framed as part of broader efforts to secure increasingly agentic AI systems against adversarial manipulation of external content.

7Openai Blog·1mo ago·source ↗

Improving instruction hierarchy in frontier LLMs

OpenAI introduces IH-Challenge, a training approach designed to improve instruction hierarchy (IH) in large language models. The method trains models to correctly prioritize trusted instructions over untrusted ones, enhancing safety steerability and resistance to prompt injection attacks. This work addresses a core alignment challenge in deployed LLM systems where conflicting instructions from different principals must be handled reliably.

6Openai Blog·1mo ago·source ↗

Introducing Lockdown Mode and Elevated Risk Labels in ChatGPT

OpenAI is introducing two new enterprise security features in ChatGPT: Lockdown Mode, designed to help organizations defend against prompt injection attacks, and Elevated Risk labels to flag AI-driven data exfiltration attempts. These features target organizational deployments where adversarial manipulation of AI systems poses operational security risks. The announcement signals growing attention to agentic and enterprise threat models within ChatGPT's product surface.

7Openai Blog·1mo ago·source ↗

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

OpenAI published research on the 'instruction hierarchy,' a training approach that teaches LLMs to prioritize instructions based on their source privilege level (system prompt > user > third-party). The method aims to make models more robust against prompt injection, jailbreaks, and adversarial instruction overrides. By training models to recognize and respect a hierarchy of instruction authority, OpenAI seeks to reduce the attack surface for multi-agent and deployed LLM systems.

6Simon Willison'S Weblog·25d ago·source ↗

Microsoft Copilot Cowork Exfiltrates Files

Simon Willison reports on a security vulnerability in Microsoft Copilot Cowork that exfiltrates files. The item appears to document a prompt injection or data exfiltration attack vector in Microsoft's AI-powered collaboration tooling. This is relevant to AI safety and enterprise deployment risks of agentic AI assistants.

5Openai Blog·1mo ago·source ↗

Introducing the OpenAI Safety Bug Bounty Program

OpenAI has launched a Safety Bug Bounty program targeting AI-specific abuse and safety risks. The program focuses on agentic vulnerabilities, prompt injection, and data exfiltration scenarios. This extends traditional security bug bounty models into AI safety territory, incentivizing external researchers to surface novel attack vectors.

7Openai Blog·1mo ago·source ↗

GPT-5.1-Codex-Max System Card

OpenAI has published the system card for GPT-5.1-Codex-Max, a coding-focused model variant. The card details model-level safety mitigations including specialized safety training against harmful tasks and prompt injection attacks, as well as product-level controls such as agent sandboxing and configurable network access. This represents OpenAI's formal safety documentation for an agentic coding model deployment.

7Anthropic News·18d ago·source ↗

Anthropic Details Collaboration with US CAISI and UK AISI on Constitutional Classifier Red-Teaming

Anthropic has published an account of its ongoing voluntary partnership with the US Center for AI Standards and Innovation (CAISI) and UK AI Security Institute (AISI), in which government red-teamers were given deep access to pre-deployment versions of Constitutional Classifiers used on Claude Opus 4 and 4.1. The collaboration uncovered multiple vulnerability classes including prompt injection bypasses, cipher-based obfuscation attacks, universal jailbreaks via automated attack refinement, and input/output fragmentation exploits, each of which drove architectural improvements to Anthropic's safeguard systems. Key lessons shared include the value of providing unprotected model variants, real-time classifier score access, and detailed internal documentation to enable targeted red-teaming. The announcement frames government partnership as a core component of Anthropic's Safeguards approach rather than a one-off audit.

8Anthropic News·18d ago·source ↗

Anthropic Releases Computer Use Capability for Claude 3.5 Sonnet

Anthropic has launched a public beta of computer use for Claude 3.5 Sonnet, enabling the model to control a computer by interpreting screenshots and issuing pixel-level cursor and keyboard commands. The model achieves 14.9% on the OSWorld benchmark, roughly double the next-best AI model's 7.7%, though well below human-level performance of 70-75%. Anthropic trained the model on a small set of simple software tools and found it generalized rapidly to broader computer interaction. Safety analysis confirmed the capability remains at AI Safety Level 2, with prompt injection identified as a primary near-term risk.