5Latent Space (swyx)·3h ago

Gray Swan founders Zico Kolter and Matt Fredrikson on AI red-teaming and security

OpenAI board member Zico Kolter and Gray Swan CEO Matt Fredrikson appear on the Latent Space podcast to discuss AI security and red-teaming, arguing the field is distinct from conventional cybersecurity. Gray Swan is a company focused on AI safety and adversarial robustness. The conversation follows Gray Swan's 'Mythos' work and addresses how AI-specific threat models differ from traditional security paradigms.

Evaluation and Benchmarking AI Safety Research Gray Swan Matt Fredrikson OpenAI Zico Kolter Latent Space

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

Related events (8)

6Openai Blog·1mo ago·source ↗

Advancing Red Teaming with People and AI

OpenAI published a blog post describing advances in their red teaming methodology, combining human red teamers with AI-assisted approaches. The post outlines how AI tools are being integrated into the red teaming pipeline to improve coverage and efficiency of safety evaluations. This represents an evolution in OpenAI's pre-deployment safety testing practices.

Evaluation and Benchmarking AI Safety Research AI-assisted red teaming OpenAI human red teaming +1 more

6Anthropic News·18d ago·source ↗

Anthropic details red teaming methods and calls for standardized AI testing practices

Anthropic published a detailed overview of red teaming approaches used to test Claude and other AI systems, covering domain-specific expert testing, automated red teaming, multilingual/multicultural testing, and multimodal red teaming. The post documents empirical findings about when each method is appropriate, highlights partnerships with organizations like Thorn, Institute for Strategic Dialogue, and Singapore's IMDA, and closes with policy recommendations for building a standardized AI testing ecosystem. The piece is notable for its operational specificity and its explicit call for industry-wide standards to enable cross-system safety comparisons.

Evaluation and Benchmarking AI Safety Research Thorn Claude AI Verify Foundation +6 more

4Don'T Worry About The Vase·1mo ago·source ↗

Cyber Lack of Security and AI Governance

Zvi Mowshowitz's commentary addresses the intersection of AI capabilities and cybersecurity, framing recent developments around GPT-5.5 and a 'Mythos Moment' as catalysts for both internet security patching efforts and emerging AI regulatory frameworks. The piece situates cybersecurity as the underreported background story of current AI progress. It appears to analyze governance and safety implications of frontier model releases in the context of cyber vulnerabilities.

Frontier Model Releases AI Safety Research Mythos Moment OpenAI Zvi Mowshowitz +2 more

4Openai Blog·1mo ago·source ↗

Security on the path to AGI

OpenAI published a post outlining its approach to security as the organization advances toward AGI. The piece describes how security measures are being built directly into infrastructure and models proactively. The content is high-level and framing-oriented, with limited technical specifics visible in the excerpt.

Training Infrastructure AI Safety Research AGI OpenAI

7Anthropic News·19d ago·source ↗

Anthropic publishes frontier threats red teaming methodology and biosecurity findings

Anthropic describes its 'frontier threats red teaming' program, sharing methodology and high-level findings from a 150+ hour biosecurity red-teaming project conducted with domain experts. The team found that current frontier models can sometimes produce expert-level biological information, that risks are likely to grow as models scale and gain tool access, and that unmitigated LLMs could accelerate bioweapon-related misuse within two to three years. Mitigations including training-process changes and classifier-based filters have been deployed, and Anthropic is sharing findings with governments and other labs while calling for more independent red-teaming efforts.

Frontier Model Releases AI Safety Research Dario Amodei Constitutional AI Anthropic

4Openai Blog·1mo ago·source ↗

Strengthening cyber resilience as AI capabilities advance

OpenAI published a post outlining its approach to cybersecurity risk as its models grow more capable, covering risk assessment frameworks, misuse mitigation, and collaboration with the security community. The piece addresses both offensive risk (AI-enabled attacks) and defensive applications. It represents OpenAI's public positioning on responsible deployment in a high-stakes domain.

AI Safety Research Enterprise Deployment Patterns OpenAI

7Anthropic News·20d ago·source ↗

Anthropic Details Collaboration with US CAISI and UK AISI on Constitutional Classifier Red-Teaming

Anthropic has published an account of its ongoing voluntary partnership with the US Center for AI Standards and Innovation (CAISI) and UK AI Security Institute (AISI), in which government red-teamers were given deep access to pre-deployment versions of Constitutional Classifiers used on Claude Opus 4 and 4.1. The collaboration uncovered multiple vulnerability classes including prompt injection bypasses, cipher-based obfuscation attacks, universal jailbreaks via automated attack refinement, and input/output fragmentation exploits, each of which drove architectural improvements to Anthropic's safeguard systems. Key lessons shared include the value of providing unprotected model variants, real-time classifier score access, and detailed internal documentation to enable targeted red-teaming. The announcement frames government partnership as a core component of Anthropic's Safeguards approach rather than a one-off audit.

Frontier Model Releases Evaluation and Benchmarking Constitutional Classifiers prompt injection Claude Opus 4.6 +6 more

8Anthropic News·19d ago·source ↗

Anthropic Frontier Red Team reports early-warning signs of rapid AI progress in cybersecurity and biosecurity capabilities

Anthropic's Frontier Red Team published findings from a year of safety evaluations across four model releases, documenting rapid capability gains in dual-use domains. In cybersecurity, Claude 3.7 Sonnet now solves roughly a third of Cybench CTF challenges (up from ~5% a year ago), and with the Incalmo toolset was able to replicate a large-scale network attack in realistic cyber range environments. In biosecurity, Claude has moved from underperforming virology experts to exceeding them on the VCT benchmark within one year, and exceeds human expert baselines on cloning workflows. Anthropic assesses current models as showing 'early warning' signs but not yet crossing thresholds of substantially elevated national security risk.

Frontier Model Releases Evaluation and Benchmarking Intercode CTF Carnegie Mellon University LabBench +7 more