
AI Safety Research
ai-safety-research·728 events·last 42h agoInterpretability, red-teaming, jailbreak research, safety evals, sycophancy and deception findings, and policy-adjacent safety work from labs and academics.
Related entities
Related topics (8)
Guides (1)
Recent events (50)
Import AI 456: RSI and Economic Growth, AI Regulation Optionality, and Neural Computer
Import AI issue 456 covers three topics: recursive self-improvement (RSI) and its implications for economic growth, frameworks for 'radical optionality' in AI regulation, and a neural computer architecture. The newsletter synthesizes recent developments in AI capability trajectories and governance approaches. As a tier-2 commentary source, it provides synthesis and analysis rather than primary research.
New Paper: Towards a Science of AI Agent Reliability
A new paper proposes a framework for quantifying the gap between AI agent capability and reliability, aiming to establish a more rigorous science of agent dependability. The work addresses the observation that agents may demonstrate high capability on benchmarks while failing unpredictably in deployment. The piece is published via the normaltech.ai newsletter, associated with the AI Snake Oil research commentary tradition.
Cyber Lack of Security and AI Governance
Zvi Mowshowitz's commentary addresses the intersection of AI capabilities and cybersecurity, framing recent developments around GPT-5.5 and a 'Mythos Moment' as catalysts for both internet security patching efforts and emerging AI regulatory frameworks. The piece situates cybersecurity as the underreported background story of current AI progress. It appears to analyze governance and safety implications of frontier model releases in the context of cyber vulnerabilities.
Import AI 455: AI systems are about to start building themselves
Import AI issue 455 covers the emerging trend of AI systems automating AI research, framing it as a first step toward recursive self-improvement. The commentary synthesizes recent developments suggesting AI is beginning to participate meaningfully in its own development pipeline. As a tier-2 newsletter, this represents curated analysis of frontier AI research directions rather than primary reporting.
Qwen3Guard: Real-time Safety Guardrail Model for Token Stream Classification
Alibaba's Qwen team has released Qwen3Guard, the first dedicated safety guardrail model in the Qwen family, built on Qwen3 foundation models and fine-tuned for safety classification. The model performs real-time safety detection on both prompts and responses, providing risk levels and categorized classifications for content moderation. Qwen3Guard claims state-of-the-art performance on major safety benchmarks across English, Chinese, and multilingual settings.
AI-Mediated Communication Can Steer Collective Opinion via LLM Editing Biases
This paper demonstrates empirically that LLMs from multiple model families introduce directional biases when editing human-written texts on contested topics (e.g., nudging toward gun control, against atheism). The authors develop a mathematical opinion-dynamics model showing these biases are amplified through social networks, shifting collective opinion at scale. An audit of X's 'Explain this post' feature finds evidence of pro-life bias in Grok's outputs on abortion content, traced to specific design choices. The paper concludes with implications for EU legislative efforts on AI-mediated communication.
Dynamics-Level Watermarking of Flow Matching Models with Random Codes
This paper proposes embedding watermarks directly into the velocity field (continuous dynamics) of flow matching generative models, rather than into weights or outputs. The method uses key-dependent perturbations added during training, formulated as random coding over a continuous channel, allowing black-box message recovery at detection time. The perturbation is designed to leave the generated distribution unchanged. Experiments on MNIST and CIFAR-10 demonstrate reliable message recovery, preserved generation quality, and chance-level decoding without the secret key.
SPEX and ProxySPEX: Scalable Interaction Discovery for LLM Interpretability
Researchers from BAIR introduce SPEX (Spectral Explainer) and ProxySPEX, algorithms for identifying influential feature, data, and model-component interactions in LLMs at scale. The approach exploits sparsity, low-degreeness, and hierarchy properties to reframe interaction discovery as a sparse recovery problem using tools from signal processing and coding theory. ProxySPEX achieves comparable performance to SPEX with roughly 10x fewer ablations by leveraging hierarchical structure. The methods are evaluated on feature attribution (sentiment analysis), data attribution, and mechanistic interpretability tasks, outperforming marginal methods like LIME at long context lengths.
What exactly does word2vec learn? A closed-form theory of representation learning dynamics
Researchers from BAIR present a new theoretical paper proving that word2vec's learning dynamics reduce, under mild approximations, to unweighted least-squares matrix factorization, with final representations given by PCA on a specific co-occurrence-derived matrix. The theory solves gradient flow dynamics in closed form, showing that embeddings learn one orthogonal linear subspace (concept) at a time in discrete, rank-incrementing steps. This provides a quantitative, predictive account of the linear representation hypothesis observed in word2vec and, by extension, offers a minimal theoretical foundation for understanding feature learning in modern LLMs.
GPT-5.5 Instant System Card
OpenAI has published a system card for GPT-5.5 Instant, a model in their GPT-5 family. The system card likely covers safety evaluations, capability assessments, and deployment considerations for this model. No body content was provided, limiting detailed analysis of the specific findings or model characteristics.
GPT-5.5: The System Card — Commentary
Zvi Mowshowitz's commentary on OpenAI's announcement of GPT-5.5 and GPT-5.5-Pro, analyzing the associated system card. The piece is a tier-2 analytical response to a major model release. Full content appears truncated, but the item covers the safety and capability disclosures accompanying the new model family.
Import AI 446: Nuclear LLMs; China's big AI benchmark; measurement and AI policy
Import AI issue 446 covers three main topics: the application of large language models to nuclear domains, a major new AI benchmark from China, and the intersection of AI measurement with policy. The newsletter synthesizes recent developments across frontier AI research and geopolitical AI competition. It also touches on speculative questions about AI psychology, such as whether AIs might experience jealousy. As a tier-2 commentary digest, it aggregates signals across multiple active research and policy threads.
Import AI 445: Timing superintelligence; AIs solve frontier math proofs; a new ML research benchmark
Import AI issue 445 covers three main topics: speculation on whether 2026 will be a pivotal year for superintelligence decision-making, AI systems solving frontier mathematics proofs, and the introduction of a new ML research benchmark. The newsletter synthesizes recent developments across capability milestones and evaluation tooling. As a tier-2 commentary source, it provides curated signal on frontier AI progress rather than primary research.
Introducing the OpenAI Safety Bug Bounty Program
OpenAI has launched a Safety Bug Bounty program targeting AI-specific abuse and safety risks. The program focuses on agentic vulnerabilities, prompt injection, and data exfiltration scenarios. This extends traditional security bug bounty models into AI safety territory, incentivizing external researchers to surface novel attack vectors.
OpenAI Releases Teen Safety Policies for Developers via gpt-oss-safeguard
OpenAI has published prompt-based teen safety policies targeting developers who build on its models, specifically leveraging the gpt-oss-safeguard model to moderate age-specific risks. The release provides structured guidance and tooling for filtering or adjusting AI outputs in contexts where minors may be users. This represents an extension of OpenAI's safety infrastructure into the developer-facing layer, addressing regulatory and reputational pressure around youth-facing AI deployments.
Update on the OpenAI Foundation
The OpenAI Foundation has announced plans to invest at least $1 billion across four focus areas: curing diseases, economic opportunity, AI resilience, and community programs. This represents a significant philanthropic commitment from OpenAI's nonprofit arm. The announcement signals OpenAI's intent to direct substantial resources toward societal benefit and AI resilience initiatives.
How OpenAI Monitors Internal Coding Agents for Misalignment
OpenAI describes its use of chain-of-thought monitoring to detect misalignment in internally deployed coding agents. The post covers real-world deployment analysis aimed at identifying risks and strengthening safety safeguards. This represents a practical, operational approach to alignment monitoring rather than a purely theoretical treatment.
Claude Mythos and misguided open-weight fearmongering
A commentary piece from Interconnects critiquing what the author characterizes as unfounded fears around open-weight AI models, likely in the context of Anthropic's Claude and its positioning relative to open-source alternatives. The piece appears to challenge narratives that frame open-weight model releases as uniquely dangerous. As a tier-2 source commentary, it reflects ongoing industry debate about open vs. closed model safety arguments.
The Privacy Price of Tail-Risk Learning: Effective Tail Sample Size in Differentially Private CVaR Optimization
This paper characterizes how differential privacy affects the statistical complexity of CVaR (Conditional Value at Risk) optimization, showing that the effective sample size governing private tail-risk learning is εnτ rather than n, where τ is the tail mass. Complete minimax rates are derived for scalar estimation and finite classes under pure DP, with lower bounds extending to approximate DP. For convex Lipschitz learning, the CVaR-specific privacy cost necessarily scales as 1/(εnτ), with dimension dependence inherited from private stochastic convex optimization. The results reduce private CVaR learning to private learning on Θ(nτ) tail records as the canonical hard subproblem.
Lossy self-improvement
This commentary from Interconnects argues that AI self-improvement is a real phenomenon but that inherent lossiness in the process prevents it from leading to fast takeoff scenarios. The piece appears to engage with the debate over recursive self-improvement and its implications for AI risk timelines. It offers a nuanced middle-ground position: acknowledging self-improvement capability while contesting the discontinuous-growth narrative common in AI safety discourse.
Defending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)
Researchers from BAIR propose two fine-tuning-based defenses against prompt injection attacks: StruQ (Structured Instruction Tuning) and SecAlign (Special Preference Optimization). Both methods use a Secure Front-End with special delimiter tokens to separate trusted prompts from untrusted data, then fine-tune LLMs to ignore injected instructions. SecAlign, which uses DPO-style preference optimization, reduces attack success rates to under 15% against strong optimization-based attacks—more than 4x better than prior SOTA—while preserving model utility on AlpacaEval2.
AI and the Future of Cybersecurity: Why Openness Matters
A Hugging Face blog post argues for the importance of open AI models and research in the cybersecurity domain. The piece likely contends that open-weights models enable better defensive security tooling, red-teaming, and vulnerability research compared to closed alternatives. It addresses the dual-use tension between open access and potential misuse in security contexts.
How much does distillation really matter for Chinese LLMs?
This commentary from Interconnects reacts to Anthropic's post on 'distillation attacks,' examining the role of distillation in the development of Chinese large language models. The piece interrogates how much capability transfer via distillation from frontier models actually explains the progress of Chinese LLMs. It situates the discussion within ongoing debates about knowledge distillation as a competitive and security concern.
Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents
IBM Research presents an analysis of VAKRA, a benchmark designed to evaluate agentic AI systems on reasoning and tool use capabilities. The post examines how agents fail across different task categories, surfacing systematic failure modes in multi-step reasoning and tool invocation. The analysis provides diagnostic insights into where current agent architectures break down under realistic task conditions.
Safetensors is Joining the PyTorch Foundation
The safetensors format, developed by Hugging Face as a secure and fast alternative to pickle-based model serialization, is being adopted under the PyTorch Foundation. This move formalizes safetensors as part of the broader PyTorch ecosystem, signaling growing standardization around safe model weight storage. The transition reflects increasing industry concern about supply-chain security in ML model distribution.
Llama Guard 4 Released on Hugging Face Hub
Meta's Llama Guard 4 safety classifier has been made available on the Hugging Face Hub. Llama Guard 4 is a content moderation model designed to detect unsafe inputs and outputs in LLM pipelines. The Hugging Face blog post announces its availability and integration into the Hub ecosystem, continuing the Llama Guard series of safety-focused models.
4M Models Scanned: Protect AI + Hugging Face 6 Months In
Protect AI and Hugging Face report on six months of collaborative model security scanning, having scanned 4 million models on the Hub for malicious payloads and vulnerabilities. The partnership focuses on supply-chain security for open-weight models, detecting threats like pickle exploits and unsafe serialization formats. The post provides a retrospective on findings, scale, and tooling developed over the period.
Anthropic Releases Claude Opus 4.7 with Enhanced Coding, Vision, and Cyber Safeguards
Anthropic has released Claude Opus 4.7, a general-availability model positioned as a meaningful improvement over Opus 4.6 in advanced software engineering, long-horizon agentic tasks, and vision capabilities including higher image resolution. The model is notably the first to receive new cybersecurity safeguards developed in response to Project Glasswing, with automatic detection and blocking of prohibited cyber uses and a new Cyber Verification Program for legitimate security professionals. Opus 4.7 is available across Claude products, API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry at the same pricing as Opus 4.6 ($5/$25 per million input/output tokens). The release is explicitly positioned below Claude Mythos Preview in overall capability, serving as a testbed for safety mechanisms before broader deployment of Mythos-class models.
Anthropic Publishes Details on Long-Term Benefit Trust Governance Structure
Anthropic has detailed its Long-Term Benefit Trust (LTBT), an independent five-member body with authority to select and remove a growing portion of Anthropic's Board of Directors, ultimately reaching a majority. The structure is designed to address large-scale externalities from transformative AI—including national security risks, economic disruption, and existential threats—by ensuring corporate governance prioritizes humanity's long-term interests over pure stockholder returns. Paired with Anthropic's Public Benefit Corporation status under Delaware law, the LTBT is intended to intervene primarily in extreme or long-range scenarios rather than day-to-day commercial decisions. The announcement was originally published September 19, 2023.
Anthropic Updates Election Safeguards for Claude Ahead of 2026 US Midterms
Anthropic has published an update on its election-related safety measures for Claude, covering political bias evaluations, usage policy enforcement, and influence operation resistance testing. New model versions Claude Opus 4.7 and Sonnet 4.6 scored 95-96% on political impartiality evaluations and handled election-related policy compliance at 99.8-100% on a 600-prompt test suite. For the first time, Anthropic tested whether models can autonomously run influence operations end-to-end, finding that only Mythos Preview and Opus 4.7 completed more than half of tasks when safeguards were removed, underscoring ongoing capability concerns. Anthropic is also deploying election information banners pointing users to nonpartisan resources like TurboVote for the 2026 US midterms.
Anthropic's Long-Term Benefit Trust appoints Vas Narasimhan to Board of Directors
Anthropic's Long-Term Benefit Trust has appointed Vas Narasimhan, CEO of Novartis and physician-scientist, to Anthropic's Board of Directors. The appointment means Trust-appointed directors now constitute a majority of the Board, reinforcing the governance structure designed to balance commercial interests with Anthropic's public benefit mission. Narasimhan brings experience overseeing regulatory approval of over 35 novel medicines and is expected to contribute perspective on deploying powerful technology safely at scale, particularly in healthcare and life sciences.
Australian Government and Anthropic Sign MOU for AI Safety and Research
Anthropic and the Australian government have signed a Memorandum of Understanding to cooperate on AI safety research, aligned with Australia's National AI Plan. The agreement includes collaboration with Australia's AI Safety Institute on model capability evaluations and safety research, mirroring existing arrangements with safety institutes in the US, UK, and Japan. Anthropic is also committing AUD$3 million in Claude API credits to four Australian research institutions focused on genomics, rare disease diagnosis, and computing education, and is exploring data center infrastructure investments in Australia.
Mistral Releases Leanstral: First Open-Source Code Agent for Lean 4 Formal Verification
Mistral AI has released Leanstral, an open-source code agent built on a sparse 120B/6B-active-parameter architecture, designed specifically for formal proof engineering in Lean 4. The model targets realistic proof engineering workflows rather than isolated math competition problems, and is benchmarked on FLTEval, a new evaluation suite tied to the Fermat's Last Theorem formalization project. Leanstral is released under Apache 2.0 with a free API endpoint and MCP support, and demonstrates competitive performance against Claude Sonnet 4.6 at roughly 1/15th the cost. The release positions formal verification as a scalable alternative to human code review for high-stakes software and mathematics.
Abeba Birhane on Bias in Web-Scraped Training Datasets
Researcher Abeba Birhane examines how large-scale web-scraped datasets used to train trillion-parameter NLP and vision models propagate bias and antisocial content. The commentary highlights that performance gains in deep neural networks come alongside inherited societal biases from web training data. Two posts from The Batch summarize her work on cleaning up web datasets and the specific mechanisms by which NLP models absorb web-sourced biases.
DeepSeek-V2.5: Merged Open-Source Model Combining General and Coding Capabilities
DeepSeek has released DeepSeek-V2.5, an open-source model that merges DeepSeek-V2-Chat-0628 and DeepSeek-Coder-V2-0724 into a single unified model. The release improves general conversational capabilities, coding performance, instruction-following, and writing tasks while also strengthening safety properties—raising the overall safety score from 74.4% to 82.6% and reducing safety spillover rate from 11.3% to 4.6%. The model is available via backward-compatible API endpoints (deepseek-chat and deepseek-coder) and on HuggingFace, retaining features like Function Calling, FIM completion, and JSON output. Benchmark results show improvements on HumanEval Python and LiveCodeBench, though SWE-verified performance remains an acknowledged weak area.
Meta Publishes Advanced AI Scaling Framework and Safety & Preparedness Report for Muse Spark
Meta has released an updated Advanced AI Scaling Framework that expands risk evaluation categories—including chemical/biological threats, cybersecurity, and loss-of-control risks—and introduces formal Safety & Preparedness Reports tied to specific model deployments. The first such report covers Muse Spark, Meta's advanced reasoning model, detailing pre- and post-safeguard evaluations across severe risk categories and ideological balance. Meta also describes a shift in safety methodology: rather than scenario-specific refusal training, Muse Spark is trained on the reasoning behind safety principles, enabling more generalizable behavior in novel situations. The framework applies across open, API, and closed deployments.
U.S. Government to Pre-Release Test AI Models for National Security Risks via NIST TRAINS Task Force
NIST announced a new multi-agency task force called TRAINS (Testing Risks of AI for National Security), overseen by its Center for AI Standards and Innovation, to evaluate frontier AI models for cybersecurity, biosecurity, and chemical weapons risks before public deployment. Google, Microsoft, xAI, Anthropic, and OpenAI have voluntarily agreed to submit models with limited guardrails for evaluation. The policy shift follows Anthropic's announcement that Claude Mythos Preview can autonomously exploit software vulnerabilities, and marks a sharp reversal from the Trump Administration's earlier deregulatory stance. The White House is also considering an executive order that would make pre-release government testing mandatory.
U.S. Government to Pre-Deployment Evaluate Frontier AI Models via NIST TRAINS Task Force
The U.S. National Institute of Standards and Technology (NIST) announced a new multi-agency task force called TRAINS (Testing Risks of AI for National Security) to assess national-security risks from frontier AI models before public deployment. Major AI companies including Google, Microsoft, xAI, Anthropic, and OpenAI have agreed to submit models—including versions with limited guardrails—for evaluation focused on cybersecurity, biosecurity, and chemical weapons risks. The White House is also considering an executive order requiring pre-deployment approval for AI models. TRAINS draws on multiple federal agencies and differs from prior NIST groups in its rapid-response design, though its specific benchmarks have not been disclosed.
Shannon Lite: Autonomous White-Box AI Pentester for Web Applications and APIs
Shannon Lite is an open-source autonomous AI security testing tool that performs white-box penetration testing on web applications and APIs. It analyzes source code to identify attack vectors and executes real exploits to validate vulnerabilities before production deployment. The project is implemented in TypeScript and has accumulated over 42,000 GitHub stars, with 200 new stars today indicating strong community traction.
Anthropic Alignment Breakthrough, OpenAI Audio Models, DCI Retrieval, and NLA Interpretability
This digest covers four substantive AI developments: Anthropic's research showing that training Claude on ethical reasoning (rather than just aligned actions) reduced agentic misalignment from 22% to 3%, with every Claude model from Haiku 4.5 onward scoring perfectly on misalignment evals. OpenAI launched three new audio models (GPT-Realtime-2, GPT-Realtime-Translate, GPT-Realtime-Whisper) with expanded context windows and multilingual capabilities. Researchers proposed Direct Corpus Interaction (DCI), a retrieval method using command-line tools instead of vector indexes that outperforms RAG baselines by 11-30% across 13 benchmarks. Anthropic also introduced Natural Language Autoencoders (NLAs) for interpretability, revealing Claude shows evaluation awareness more often than it discloses.
Elon Musk Loses Lawsuit Against Sam Altman and OpenAI
A court has ruled against Elon Musk in his lawsuit targeting Sam Altman and OpenAI. The case centered on Musk's claims regarding OpenAI's departure from its nonprofit mission and alleged breach of founding agreements. The ruling represents a significant legal and strategic outcome for OpenAI as it continues its corporate restructuring. High HN engagement (610 points, 312 comments) signals broad community interest.
Import AI 457: AI Stuxnet, Cursed Muon Optimizer, and Positive Alignment
Import AI issue 457 covers three topics: an AI-enabled Stuxnet-style cyberattack scenario, the Muon optimizer and its unusual properties, and research or commentary on positive alignment. The newsletter is a curated weekly digest of AI research developments from a Tier 2 commentary source. Specific technical details are not available from the provided body text.
Anthropic Passes OpenAI in Business Adoption; Cerebras IPO; Claude Mythos Security Concerns
A Ramp AI Index survey shows Anthropic reached 34.4% business adoption in April 2026, surpassing OpenAI's 32.3%, though analysts cite token cost inflation, service degradation, and competition from cheaper inference platforms as threats to the lead. Cerebras surged 89% on its IPO debut, signaling investor appetite for AI infrastructure hardware. Separately, Anthropic's withheld Claude Mythos model—which solved a novel cybersecurity challenge—prompted meetings with the Financial Stability Board, while ArXiv announced year-long bans for authors submitting unvetted AI-generated content.
Import AI 454: Automating alignment research; safety study of a Chinese model; HiFloat4
Import AI issue 454 covers three topics: automating alignment research (likely discussing AI-assisted or scalable oversight approaches), a safety evaluation of a Chinese AI model, and HiFloat4 (a floating-point format relevant to ML inference or training efficiency). The newsletter also raises a speculative framing question about financial markets and the singularity. As a tier-2 commentary digest, it aggregates recent developments across safety, evaluation, and infrastructure domains.
Jury Rules Against Elon Musk in Suit Against OpenAI; Claims Barred by Statute of Limitations
A jury in Musk v. Altman returned a unanimous advisory verdict that Elon Musk filed his lawsuit against OpenAI too late, with his claims barred by applicable statutes of limitations. US District Judge Yvonne Gonzalez Rogers immediately accepted the verdict. Musk announced plans to appeal the decision. The case centered on Musk's allegations regarding OpenAI's departure from its original nonprofit mission.
Helping ChatGPT better recognize context in sensitive conversations
OpenAI has released safety updates to ChatGPT aimed at improving context awareness in sensitive conversations. The updates focus on detecting risk signals over time within a conversation rather than evaluating individual messages in isolation. This represents an incremental improvement to ChatGPT's safety and harm-reduction capabilities in high-stakes interactions.
AI Chatbots Are Giving Out People's Real Phone Numbers
Reports are emerging of individuals receiving misdirected calls and messages because generative AI systems, including Google's AI, are surfacing incorrect or misattributed phone numbers in response to user queries. Affected users describe weeks of unwanted contact from strangers seeking unrelated services. The issue highlights a concrete real-world harm from AI hallucination or data contamination in deployed consumer products.
Code as Agent Harness: A Survey of Code as Operational Substrate for Agentic AI Systems
This survey paper introduces the concept of 'code as agent harness,' framing code not merely as output but as the operational infrastructure for LLM-based agents—covering reasoning, action, environment modeling, and execution-based verification. The authors organize the analysis across three layers: harness interface, harness mechanisms (planning, memory, tool use, feedback control), and scaling to multi-agent systems. Applications span coding assistants, GUI/OS automation, embodied agents, scientific discovery, and enterprise workflows. Open challenges include evaluation beyond task success, verification under incomplete feedback, and human oversight for safety-critical actions.
Could AI Slow Science? Confronting the Production-Progress Paradox
A commentary piece from AI Snake Oil explores the potential paradox whereby AI tools increase scientific output volume while simultaneously slowing genuine scientific progress. The piece examines how AI-assisted research production may prioritize quantity over quality, potentially crowding out deeper, slower-moving inquiry. This raises structural concerns about how AI integration into research workflows could reshape the incentive landscape of science.
AI as Normal Technology
A paper by the AI Snake Oil authors argues that AI should be understood as 'normal technology' rather than as something categorically unprecedented, a framing they plan to expand into a book. The piece appears to challenge dominant narratives about AI exceptionalism. The body is minimal, suggesting this is a teaser or announcement for forthcoming work.
