Topic

AI Safety Research

activeai-safety-research·728 events·last 42h ago

Interpretability, red-teaming, jailbreak research, safety evals, sycophancy and deception findings, and policy-adjacent safety work from labs and academics.

Related entities

Guides (1)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Recent events (50)

4Import Ai·1mo ago·source ↗

Import AI 456: RSI and Economic Growth, AI Regulation Optionality, and Neural Computer

Import AI issue 456 covers three topics: recursive self-improvement (RSI) and its implications for economic growth, frameworks for 'radical optionality' in AI regulation, and a neural computer architecture. The newsletter synthesizes recent developments in AI capability trajectories and governance approaches. As a tier-2 commentary source, it provides synthesis and analysis rather than primary research.

Frontier Model Releases AI Safety Research Recursive Self-Improvement Jack Clark Import AI +1 more

5Ai Snake Oil·1mo ago·source ↗

New Paper: Towards a Science of AI Agent Reliability

A new paper proposes a framework for quantifying the gap between AI agent capability and reliability, aiming to establish a more rigorous science of agent dependability. The work addresses the observation that agents may demonstrate high capability on benchmarks while failing unpredictably in deployment. The piece is published via the normaltech.ai newsletter, associated with the AI Snake Oil research commentary tradition.

Evaluation and Benchmarking AI Safety Research Towards a Science of AI Agent Reliability normaltech.ai AI Snake Oil +2 more

4Don'T Worry About The Vase·1mo ago·source ↗

Cyber Lack of Security and AI Governance

Zvi Mowshowitz's commentary addresses the intersection of AI capabilities and cybersecurity, framing recent developments around GPT-5.5 and a 'Mythos Moment' as catalysts for both internet security patching efforts and emerging AI regulatory frameworks. The piece situates cybersecurity as the underreported background story of current AI progress. It appears to analyze governance and safety implications of frontier model releases in the context of cyber vulnerabilities.

Frontier Model Releases AI Safety Research Mythos Moment OpenAI Zvi Mowshowitz +2 more

5Import Ai·1mo ago·source ↗

Import AI 455: AI systems are about to start building themselves

Import AI issue 455 covers the emerging trend of AI systems automating AI research, framing it as a first step toward recursive self-improvement. The commentary synthesizes recent developments suggesting AI is beginning to participate meaningfully in its own development pipeline. As a tier-2 newsletter, this represents curated analysis of frontier AI research directions rather than primary reporting.

Frontier Model Releases AI Safety Research Recursive Self-Improvement automated AI research Jack Clark +2 more

6Qwen Research·1mo ago·source ↗

Qwen3Guard: Real-time Safety Guardrail Model for Token Stream Classification

Alibaba's Qwen team has released Qwen3Guard, the first dedicated safety guardrail model in the Qwen family, built on Qwen3 foundation models and fine-tuned for safety classification. The model performs real-time safety detection on both prompts and responses, providing risk levels and categorized classifications for content moderation. Qwen3Guard claims state-of-the-art performance on major safety benchmarks across English, Chinese, and multilingual settings.

Frontier Model Releases AI Safety Research Qwen3Guard Alibaba Qwen Hugging Face +3 more

7arXiv · cs.LG·1mo ago·source ↗

AI-Mediated Communication Can Steer Collective Opinion via LLM Editing Biases

This paper demonstrates empirically that LLMs from multiple model families introduce directional biases when editing human-written texts on contested topics (e.g., nudging toward gun control, against atheism). The authors develop a mathematical opinion-dynamics model showing these biases are amplified through social networks, shifting collective opinion at scale. An audit of X's 'Explain this post' feature finds evidence of pro-life bias in Grok's outputs on abortion content, traced to specific design choices. The paper concludes with implications for EU legislative efforts on AI-mediated communication.

Evaluation and Benchmarking AI Safety Research Grok X (Twitter)EU AI Act +5 more

5arXiv · cs.LG·1mo ago·source ↗

Dynamics-Level Watermarking of Flow Matching Models with Random Codes

This paper proposes embedding watermarks directly into the velocity field (continuous dynamics) of flow matching generative models, rather than into weights or outputs. The method uses key-dependent perturbations added during training, formulated as random coding over a continuous channel, allowing black-box message recovery at detection time. The perturbation is designed to leave the generated distribution unchanged. Experiments on MNIST and CIFAR-10 demonstrate reliable message recovery, preserved generation quality, and chance-level decoding without the secret key.

Evaluation and Benchmarking AI Safety Research MNIST CIFAR-10 Random Coding +2 more

6Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

SPEX and ProxySPEX: Scalable Interaction Discovery for LLM Interpretability

Researchers from BAIR introduce SPEX (Spectral Explainer) and ProxySPEX, algorithms for identifying influential feature, data, and model-component interactions in LLMs at scale. The approach exploits sparsity, low-degreeness, and hierarchy properties to reframe interaction discovery as a sparse recovery problem using tools from signal processing and coding theory. ProxySPEX achieves comparable performance to SPEX with roughly 10x fewer ablations by leveraging hierarchical structure. The methods are evaluated on feature attribution (sentiment analysis), data attribution, and mechanistic interpretability tasks, outperforming marginal methods like LIME at long context lengths.

Long Context Evolution Evaluation and Benchmarking GPT-4o mini Faith-Shap LIME +5 more

5Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

What exactly does word2vec learn? A closed-form theory of representation learning dynamics

Researchers from BAIR present a new theoretical paper proving that word2vec's learning dynamics reduce, under mild approximations, to unweighted least-squares matrix factorization, with final representations given by PCA on a specific co-occurrence-derived matrix. The theory solves gradient flow dynamics in closed form, showing that embeddings learn one orthogonal linear subspace (concept) at a time in discrete, rank-incrementing steps. This provides a quantitative, predictive account of the linear representation hypothesis observed in word2vec and, by extension, offers a minimal theoretical foundation for understanding feature learning in modern LLMs.

AI Safety Research Alignment and RLHF Berkeley AI Research (BAIR)gradient flow dynamics matrix factorization +3 more

7Openai Blog·1mo ago·source ↗

GPT-5.5 Instant System Card

OpenAI has published a system card for GPT-5.5 Instant, a model in their GPT-5 family. The system card likely covers safety evaluations, capability assessments, and deployment considerations for this model. No body content was provided, limiting detailed analysis of the specific findings or model characteristics.

Frontier Model Releases Inference Economics GPT-5.5 Instant System Card GPT-5.5 Instant OpenAI +1 more

6Don'T Worry About The Vase·1mo ago·source ↗

GPT-5.5: The System Card — Commentary

Zvi Mowshowitz's commentary on OpenAI's announcement of GPT-5.5 and GPT-5.5-Pro, analyzing the associated system card. The piece is a tier-2 analytical response to a major model release. Full content appears truncated, but the item covers the safety and capability disclosures accompanying the new model family.

Frontier Model Releases Evaluation and Benchmarking GPT Pro OpenAI Zvi Mowshowitz +2 more

4Import Ai·1mo ago·source ↗

Import AI 446: Nuclear LLMs; China's big AI benchmark; measurement and AI policy

Import AI issue 446 covers three main topics: the application of large language models to nuclear domains, a major new AI benchmark from China, and the intersection of AI measurement with policy. The newsletter synthesizes recent developments across frontier AI research and geopolitical AI competition. It also touches on speculative questions about AI psychology, such as whether AIs might experience jealousy. As a tier-2 commentary digest, it aggregates signals across multiple active research and policy threads.

Frontier Model Releases Evaluation and Benchmarking Jack Clark Import AI China +2 more

4Import Ai·1mo ago·source ↗

Import AI 445: Timing superintelligence; AIs solve frontier math proofs; a new ML research benchmark

Import AI issue 445 covers three main topics: speculation on whether 2026 will be a pivotal year for superintelligence decision-making, AI systems solving frontier mathematics proofs, and the introduction of a new ML research benchmark. The newsletter synthesizes recent developments across capability milestones and evaluation tooling. As a tier-2 commentary source, it provides curated signal on frontier AI progress rather than primary research.

Frontier Model Releases Evaluation and Benchmarking superintelligence Jack Clark Import AI +1 more

5Openai Blog·1mo ago·source ↗

Introducing the OpenAI Safety Bug Bounty Program

OpenAI has launched a Safety Bug Bounty program targeting AI-specific abuse and safety risks. The program focuses on agentic vulnerabilities, prompt injection, and data exfiltration scenarios. This extends traditional security bug bounty models into AI safety territory, incentivizing external researchers to surface novel attack vectors.

AI Safety Research Enterprise Deployment Patterns prompt injection OpenAI Safety Bug Bounty agentic vulnerabilities +3 more

5Openai Blog·1mo ago·source ↗

OpenAI Releases Teen Safety Policies for Developers via gpt-oss-safeguard

OpenAI has published prompt-based teen safety policies targeting developers who build on its models, specifically leveraging the gpt-oss-safeguard model to moderate age-specific risks. The release provides structured guidance and tooling for filtering or adjusting AI outputs in contexts where minors may be users. This represents an extension of OpenAI's safety infrastructure into the developer-facing layer, addressing regulatory and reputational pressure around youth-facing AI deployments.

AI Safety Research Enterprise Deployment Patterns gpt-oss-safeguard OpenAI +1 more

5Openai Blog·1mo ago·source ↗

Update on the OpenAI Foundation

The OpenAI Foundation has announced plans to invest at least $1 billion across four focus areas: curing diseases, economic opportunity, AI resilience, and community programs. This represents a significant philanthropic commitment from OpenAI's nonprofit arm. The announcement signals OpenAI's intent to direct substantial resources toward societal benefit and AI resilience initiatives.

AI Safety Research Regulatory Developments OpenAI Foundation OpenAI

7Openai Blog·1mo ago·source ↗

How OpenAI Monitors Internal Coding Agents for Misalignment

OpenAI describes its use of chain-of-thought monitoring to detect misalignment in internally deployed coding agents. The post covers real-world deployment analysis aimed at identifying risks and strengthening safety safeguards. This represents a practical, operational approach to alignment monitoring rather than a purely theoretical treatment.

AI Safety Research Agent and Tool Ecosystem misalignment detection chain-of-thought monitoring OpenAI +2 more

4Interconnects·1mo ago·source ↗

Claude Mythos and misguided open-weight fearmongering

A commentary piece from Interconnects critiquing what the author characterizes as unfounded fears around open-weight AI models, likely in the context of Anthropic's Claude and its positioning relative to open-source alternatives. The piece appears to challenge narratives that frame open-weight model releases as uniquely dangerous. As a tier-2 source commentary, it reflects ongoing industry debate about open vs. closed model safety arguments.

Open Weights Progress AI Safety Research Interconnects Claude Anthropic

4arXiv · cs.LG·1mo ago·source ↗

The Privacy Price of Tail-Risk Learning: Effective Tail Sample Size in Differentially Private CVaR Optimization

This paper characterizes how differential privacy affects the statistical complexity of CVaR (Conditional Value at Risk) optimization, showing that the effective sample size governing private tail-risk learning is εnτ rather than n, where τ is the tail mass. Complete minimax rates are derived for scalar estimation and finite classes under pure DP, with lower bounds extending to approximate DP. For convex Lipschitz learning, the CVaR-specific privacy cost necessarily scales as 1/(εnτ), with dimension dependence inherited from private stochastic convex optimization. The results reduce private CVaR learning to private learning on Θ(nτ) tail records as the canonical hard subproblem.

AI Safety Research Differential Privacy Approximate DP Private Stochastic Convex Optimization +1 more

5Interconnects·1mo ago·source ↗

Lossy self-improvement

This commentary from Interconnects argues that AI self-improvement is a real phenomenon but that inherent lossiness in the process prevents it from leading to fast takeoff scenarios. The piece appears to engage with the debate over recursive self-improvement and its implications for AI risk timelines. It offers a nuanced middle-ground position: acknowledging self-improvement capability while contesting the discontinuous-growth narrative common in AI safety discourse.

Frontier Model Releases AI Safety Research Interconnects Recursive Self-Improvement fast takeoff

6Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

Defending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)

Researchers from BAIR propose two fine-tuning-based defenses against prompt injection attacks: StruQ (Structured Instruction Tuning) and SecAlign (Special Preference Optimization). Both methods use a Secure Front-End with special delimiter tokens to separate trusted prompts from untrusted data, then fine-tune LLMs to ignore injected instructions. SecAlign, which uses DPO-style preference optimization, reduces attack success rates to under 15% against strong optimization-based attacks—more than 4x better than prior SOTA—while preserving model utility on AlpacaEval2.

AI Safety Research Agent and Tool Ecosystem StruQ SecAlign Berkeley AI Research (BAIR)+7 more

4Hugging Face Blog·1mo ago·source ↗

AI and the Future of Cybersecurity: Why Openness Matters

A Hugging Face blog post argues for the importance of open AI models and research in the cybersecurity domain. The piece likely contends that open-weights models enable better defensive security tooling, red-teaming, and vulnerability research compared to closed alternatives. It addresses the dual-use tension between open access and potential misuse in security contexts.

Open Weights Progress AI Safety Research Hugging Face

5Interconnects·1mo ago·source ↗

How much does distillation really matter for Chinese LLMs?

This commentary from Interconnects reacts to Anthropic's post on 'distillation attacks,' examining the role of distillation in the development of Chinese large language models. The piece interrogates how much capability transfer via distillation from frontier models actually explains the progress of Chinese LLMs. It situates the discussion within ongoing debates about knowledge distillation as a competitive and security concern.

Frontier Model Releases Open Weights Progress knowledge distillation Interconnects distillation attacks +2 more

5Hugging Face Blog·1mo ago·source ↗

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

IBM Research presents an analysis of VAKRA, a benchmark designed to evaluate agentic AI systems on reasoning and tool use capabilities. The post examines how agents fail across different task categories, surfacing systematic failure modes in multi-step reasoning and tool invocation. The analysis provides diagnostic insights into where current agent architectures break down under realistic task conditions.

Evaluation and Benchmarking AI Safety Research IBM Research Hugging Face VAKRA +1 more

5Hugging Face Blog·1mo ago·source ↗

Safetensors is Joining the PyTorch Foundation

The safetensors format, developed by Hugging Face as a secure and fast alternative to pickle-based model serialization, is being adopted under the PyTorch Foundation. This move formalizes safetensors as part of the broader PyTorch ecosystem, signaling growing standardization around safe model weight storage. The transition reflects increasing industry concern about supply-chain security in ML model distribution.

Training Infrastructure Open Weights Progress PyTorch Foundation safetensors Hugging Face +2 more

5Hugging Face Blog·1mo ago·source ↗

Llama Guard 4 Released on Hugging Face Hub

Meta's Llama Guard 4 safety classifier has been made available on the Hugging Face Hub. Llama Guard 4 is a content moderation model designed to detect unsafe inputs and outputs in LLM pipelines. The Hugging Face blog post announces its availability and integration into the Hub ecosystem, continuing the Llama Guard series of safety-focused models.

Open Weights Progress AI Safety Research Hugging Face Llama Guard 4 Meta

5Hugging Face Blog·1mo ago·source ↗

4M Models Scanned: Protect AI + Hugging Face 6 Months In

Protect AI and Hugging Face report on six months of collaborative model security scanning, having scanned 4 million models on the Hub for malicious payloads and vulnerabilities. The partnership focuses on supply-chain security for open-weight models, detecting threats like pickle exploits and unsafe serialization formats. The post provides a retrospective on findings, scale, and tooling developed over the period.

Open Weights Progress AI Safety Research pickle exploit Protect AI Hugging Face

8Anthropic News·1mo ago·source ↗

Anthropic Releases Claude Opus 4.7 with Enhanced Coding, Vision, and Cyber Safeguards

Anthropic has released Claude Opus 4.7, a general-availability model positioned as a meaningful improvement over Opus 4.6 in advanced software engineering, long-horizon agentic tasks, and vision capabilities including higher image resolution. The model is notably the first to receive new cybersecurity safeguards developed in response to Project Glasswing, with automatic detection and blocking of prohibited cyber uses and a new Cyber Verification Program for legitimate security professionals. Opus 4.7 is available across Claude products, API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry at the same pricing as Opus 4.6 ($5/$25 per million input/output tokens). The release is explicitly positioned below Claude Mythos Preview in overall capability, serving as a testbed for safety mechanisms before broader deployment of Mythos-class models.

Frontier Model Releases Evaluation and Benchmarking Harvey Solve Intelligence Amazon Bedrock +16 more

6Anthropic News·1mo ago·source ↗

Anthropic Publishes Details on Long-Term Benefit Trust Governance Structure

Anthropic has detailed its Long-Term Benefit Trust (LTBT), an independent five-member body with authority to select and remove a growing portion of Anthropic's Board of Directors, ultimately reaching a majority. The structure is designed to address large-scale externalities from transformative AI—including national security risks, economic disruption, and existential threats—by ensuring corporate governance prioritizes humanity's long-term interests over pure stockholder returns. Paired with Anthropic's Public Benefit Corporation status under Delaware law, the LTBT is intended to intervene primarily in extreme or long-range scenarios rather than day-to-day commercial decisions. The announcement was originally published September 19, 2023.

AI Safety Research Enterprise Deployment Patterns Delaware Public Benefit Corporation Long-Term Benefit Trust Anthropic +1 more

6Anthropic News·1mo ago·source ↗

Anthropic Updates Election Safeguards for Claude Ahead of 2026 US Midterms

Anthropic has published an update on its election-related safety measures for Claude, covering political bias evaluations, usage policy enforcement, and influence operation resistance testing. New model versions Claude Opus 4.7 and Sonnet 4.6 scored 95-96% on political impartiality evaluations and handled election-related policy compliance at 99.8-100% on a 600-prompt test suite. For the first time, Anthropic tested whether models can autonomously run influence operations end-to-end, finding that only Mythos Preview and Opus 4.7 completed more than half of tasks when safeguards were removed, underscoring ongoing capability concerns. Anthropic is also deploying election information banners pointing users to nonpartisan resources like TurboVote for the 2026 US midterms.

Frontier Model Releases Evaluation and Benchmarking Collective Intelligence Project Claude Sonnet 4 Claude Opus 4.6 +9 more

5Anthropic News·1mo ago·source ↗

Anthropic's Long-Term Benefit Trust appoints Vas Narasimhan to Board of Directors

Anthropic's Long-Term Benefit Trust has appointed Vas Narasimhan, CEO of Novartis and physician-scientist, to Anthropic's Board of Directors. The appointment means Trust-appointed directors now constitute a majority of the Board, reinforcing the governance structure designed to balance commercial interests with Anthropic's public benefit mission. Narasimhan brings experience overseeing regulatory approval of over 35 novel medicines and is expected to contribute perspective on deploying powerful technology safely at scale, particularly in healthcare and life sciences.

AI Safety Research Enterprise Deployment Patterns Anthropic Long-Term Benefit Trust Dario Amodei Daniela Amodei +4 more

6Anthropic News·1mo ago·source ↗

Australian Government and Anthropic Sign MOU for AI Safety and Research

Anthropic and the Australian government have signed a Memorandum of Understanding to cooperate on AI safety research, aligned with Australia's National AI Plan. The agreement includes collaboration with Australia's AI Safety Institute on model capability evaluations and safety research, mirroring existing arrangements with safety institutes in the US, UK, and Japan. Anthropic is also committing AUD$3 million in Claude API credits to four Australian research institutions focused on genomics, rare disease diagnosis, and computing education, and is exploring data center infrastructure investments in Australia.

AI Safety Research Enterprise Deployment Patterns Dario Amodei Murdoch Children's Research Institute Australia AI Safety Institute +11 more

7Mistral Ai News·1mo ago·source ↗

Mistral Releases Leanstral: First Open-Source Code Agent for Lean 4 Formal Verification

Mistral AI has released Leanstral, an open-source code agent built on a sparse 120B/6B-active-parameter architecture, designed specifically for formal proof engineering in Lean 4. The model targets realistic proof engineering workflows rather than isolated math competition problems, and is benchmarked on FLTEval, a new evaluation suite tied to the Fermat's Last Theorem formalization project. Leanstral is released under Apache 2.0 with a free API endpoint and MCP support, and demonstrates competitive performance against Claude Sonnet 4.6 at roughly 1/15th the cost. The release positions formal verification as a scalable alternative to human code review for high-stakes software and mathematics.

Evaluation and Benchmarking Open Weights Progress Mistral AI Claude Sonnet 4 Claude Opus 4.6 +11 more

4The Batch·1mo ago·source ↗

Abeba Birhane on Bias in Web-Scraped Training Datasets

Researcher Abeba Birhane examines how large-scale web-scraped datasets used to train trillion-parameter NLP and vision models propagate bias and antisocial content. The commentary highlights that performance gains in deep neural networks come alongside inherited societal biases from web training data. Two posts from The Batch summarize her work on cleaning up web datasets and the specific mechanisms by which NLP models absorb web-sourced biases.

Evaluation and Benchmarking AI Safety Research DeepLearning.AI Abeba Birhane The Batch

6Deepseek News·1mo ago·source ↗

DeepSeek-V2.5: Merged Open-Source Model Combining General and Coding Capabilities

DeepSeek has released DeepSeek-V2.5, an open-source model that merges DeepSeek-V2-Chat-0628 and DeepSeek-Coder-V2-0724 into a single unified model. The release improves general conversational capabilities, coding performance, instruction-following, and writing tasks while also strengthening safety properties—raising the overall safety score from 74.4% to 82.6% and reducing safety spillover rate from 11.3% to 4.6%. The model is available via backward-compatible API endpoints (deepseek-chat and deepseek-coder) and on HuggingFace, retaining features like Function Calling, FIM completion, and JSON output. Benchmark results show improvements on HumanEval Python and LiveCodeBench, though SWE-verified performance remains an acknowledged weak area.

Frontier Model Releases Evaluation and Benchmarking DeepSeek-V2-Chat-0628 DeepSeek V4 SWE-Bench Verified +8 more

7Meta Ai Blog·1mo ago·source ↗

Meta Publishes Advanced AI Scaling Framework and Safety & Preparedness Report for Muse Spark

Meta has released an updated Advanced AI Scaling Framework that expands risk evaluation categories—including chemical/biological threats, cybersecurity, and loss-of-control risks—and introduces formal Safety & Preparedness Reports tied to specific model deployments. The first such report covers Muse Spark, Meta's advanced reasoning model, detailing pre- and post-safeguard evaluations across severe risk categories and ideological balance. Meta also describes a shift in safety methodology: rather than scenario-specific refusal training, Muse Spark is trained on the reasoning behind safety principles, enabling more generalizable behavior in novel situations. The framework applies across open, API, and closed deployments.

Frontier Model Releases Evaluation and Benchmarking Advanced AI Scaling Framework Meta AI Frontier AI Framework +6 more

7The Batch·1mo ago·source ↗

U.S. Government to Pre-Release Test AI Models for National Security Risks via NIST TRAINS Task Force

NIST announced a new multi-agency task force called TRAINS (Testing Risks of AI for National Security), overseen by its Center for AI Standards and Innovation, to evaluate frontier AI models for cybersecurity, biosecurity, and chemical weapons risks before public deployment. Google, Microsoft, xAI, Anthropic, and OpenAI have voluntarily agreed to submit models with limited guardrails for evaluation. The policy shift follows Anthropic's announcement that Claude Mythos Preview can autonomously exploit software vulnerabilities, and marks a sharp reversal from the Trump Administration's earlier deregulatory stance. The White House is also considering an executive order that would make pre-release government testing mandatory.

Frontier Model Releases Evaluation and Benchmarking White House Center for AI Standards and Innovation DeepSeek V4 +11 more

7The Batch·1mo ago·source ↗

U.S. Government to Pre-Deployment Evaluate Frontier AI Models via NIST TRAINS Task Force

The U.S. National Institute of Standards and Technology (NIST) announced a new multi-agency task force called TRAINS (Testing Risks of AI for National Security) to assess national-security risks from frontier AI models before public deployment. Major AI companies including Google, Microsoft, xAI, Anthropic, and OpenAI have agreed to submit models—including versions with limited guardrails—for evaluation focused on cybersecurity, biosecurity, and chemical weapons risks. The White House is also considering an executive order requiring pre-deployment approval for AI models. TRAINS draws on multiple federal agencies and differs from prior NIST groups in its rapid-response design, though its specific benchmarks have not been disclosed.

Frontier Model Releases Evaluation and Benchmarking DeepSeek V4 Microsoft Google +9 more

5Github Trending·1mo ago·source ↗

Shannon Lite: Autonomous White-Box AI Pentester for Web Applications and APIs

Shannon Lite is an open-source autonomous AI security testing tool that performs white-box penetration testing on web applications and APIs. It analyzes source code to identify attack vectors and executes real exploits to validate vulnerabilities before production deployment. The project is implemented in TypeScript and has accumulated over 42,000 GitHub stars, with 200 new stars today indicating strong community traction.

AI Safety Research Agent and Tool Ecosystem KeygraphHQ Shannon Lite

7The Batch·1mo ago·source ↗

Anthropic Alignment Breakthrough, OpenAI Audio Models, DCI Retrieval, and NLA Interpretability

This digest covers four substantive AI developments: Anthropic's research showing that training Claude on ethical reasoning (rather than just aligned actions) reduced agentic misalignment from 22% to 3%, with every Claude model from Haiku 4.5 onward scoring perfectly on misalignment evals. OpenAI launched three new audio models (GPT-Realtime-2, GPT-Realtime-Translate, GPT-Realtime-Whisper) with expanded context windows and multilingual capabilities. Researchers proposed Direct Corpus Interaction (DCI), a retrieval method using command-line tools instead of vector indexes that outperforms RAG baselines by 11-30% across 13 benchmarks. Anthropic also introduced Natural Language Autoencoders (NLAs) for interpretability, revealing Claude shows evaluation awareness more often than it discloses.

Frontier Model Releases Evaluation and Benchmarking Claude Opus 4.6 GPT-Realtime-2 Claude +14 more

7Hacker News·1mo ago·source ↗

Elon Musk Loses Lawsuit Against Sam Altman and OpenAI

A court has ruled against Elon Musk in his lawsuit targeting Sam Altman and OpenAI. The case centered on Musk's claims regarding OpenAI's departure from its nonprofit mission and alleged breach of founding agreements. The ruling represents a significant legal and strategic outcome for OpenAI as it continues its corporate restructuring. High HN engagement (610 points, 312 comments) signals broad community interest.

AI Safety Research Enterprise Deployment Patterns Elon Musk Sam Altman OpenAI +1 more

4Import Ai·1mo ago·source ↗

Import AI 457: AI Stuxnet, Cursed Muon Optimizer, and Positive Alignment

Import AI issue 457 covers three topics: an AI-enabled Stuxnet-style cyberattack scenario, the Muon optimizer and its unusual properties, and research or commentary on positive alignment. The newsletter is a curated weekly digest of AI research developments from a Tier 2 commentary source. Specific technical details are not available from the provided body text.

Training Infrastructure AI Safety Research Positive Alignment Muon Optimizer Jack Clark +2 more

6The Batch·1mo ago·source ↗

Anthropic Passes OpenAI in Business Adoption; Cerebras IPO; Claude Mythos Security Concerns

A Ramp AI Index survey shows Anthropic reached 34.4% business adoption in April 2026, surpassing OpenAI's 32.3%, though analysts cite token cost inflation, service degradation, and competition from cheaper inference platforms as threats to the lead. Cerebras surged 89% on its IPO debut, signaling investor appetite for AI infrastructure hardware. Separately, Anthropic's withheld Claude Mythos model—which solved a novel cybersecurity challenge—prompted meetings with the Financial Stability Board, while ArXiv announced year-long bans for authors submitting unvetted AI-generated content.

Training Infrastructure Frontier Model Releases Financial Stability Board Claude Mythos UK AI Security Institute +14 more

4Import Ai·1mo ago·source ↗

Import AI 454: Automating alignment research; safety study of a Chinese model; HiFloat4

Import AI issue 454 covers three topics: automating alignment research (likely discussing AI-assisted or scalable oversight approaches), a safety evaluation of a Chinese AI model, and HiFloat4 (a floating-point format relevant to ML inference or training efficiency). The newsletter also raises a speculative framing question about financial markets and the singularity. As a tier-2 commentary digest, it aggregates recent developments across safety, evaluation, and infrastructure domains.

Evaluation and Benchmarking AI Safety Research Jack Clark Import AI HiFloat4 +1 more

6Mit Technology Review — Ai·1mo ago·source ↗

Jury Rules Against Elon Musk in Suit Against OpenAI; Claims Barred by Statute of Limitations

A jury in Musk v. Altman returned a unanimous advisory verdict that Elon Musk filed his lawsuit against OpenAI too late, with his claims barred by applicable statutes of limitations. US District Judge Yvonne Gonzalez Rogers immediately accepted the verdict. Musk announced plans to appeal the decision. The case centered on Musk's allegations regarding OpenAI's departure from its original nonprofit mission.

AI Safety Research Enterprise Deployment Patterns Elon Musk Sam Altman OpenAI +2 more

5Openai Blog·1mo ago·source ↗

Helping ChatGPT better recognize context in sensitive conversations

OpenAI has released safety updates to ChatGPT aimed at improving context awareness in sensitive conversations. The updates focus on detecting risk signals over time within a conversation rather than evaluating individual messages in isolation. This represents an incremental improvement to ChatGPT's safety and harm-reduction capabilities in high-stakes interactions.

AI Safety Research Enterprise Deployment Patterns ChatGPT OpenAI

5Mit Technology Review — Ai·1mo ago·source ↗

AI Chatbots Are Giving Out People's Real Phone Numbers

Reports are emerging of individuals receiving misdirected calls and messages because generative AI systems, including Google's AI, are surfacing incorrect or misattributed phone numbers in response to user queries. Affected users describe weeks of unwanted contact from strangers seeking unrelated services. The issue highlights a concrete real-world harm from AI hallucination or data contamination in deployed consumer products.

AI Safety Research Enterprise Deployment Patterns Reddit Google Generative AI (Search)WhatsApp +1 more

6arXiv · cs.CL·1mo ago·source ↗

Code as Agent Harness: A Survey of Code as Operational Substrate for Agentic AI Systems

This survey paper introduces the concept of 'code as agent harness,' framing code not merely as output but as the operational infrastructure for LLM-based agents—covering reasoning, action, environment modeling, and execution-based verification. The authors organize the analysis across three layers: harness interface, harness mechanisms (planning, memory, tool use, feedback control), and scaling to multi-agent systems. Applications span coding assistants, GUI/OS automation, embodied agents, scientific discovery, and enterprise workflows. Open challenges include evaluation beyond task success, verification under incomplete feedback, and human oversight for safety-critical actions.

Evaluation and Benchmarking AI Safety Research embodied agents large language models Code as Agent Harness +6 more

4Ai Snake Oil·1mo ago·source ↗

Could AI Slow Science? Confronting the Production-Progress Paradox

A commentary piece from AI Snake Oil explores the potential paradox whereby AI tools increase scientific output volume while simultaneously slowing genuine scientific progress. The piece examines how AI-assisted research production may prioritize quantity over quality, potentially crowding out deeper, slower-moving inquiry. This raises structural concerns about how AI integration into research workflows could reshape the incentive landscape of science.

Evaluation and Benchmarking AI Safety Research AI Snake Oil

4Ai Snake Oil·1mo ago·source ↗

AI as Normal Technology

A paper by the AI Snake Oil authors argues that AI should be understood as 'normal technology' rather than as something categorically unprecedented, a framing they plan to expand into a book. The piece appears to challenge dominant narratives about AI exceptionalism. The body is minimal, suggesting this is a teaser or announcement for forthcoming work.

AI Safety Research Regulatory Developments AI as Normal Technology normaltech.ai AI Snake Oil

AI Safety Research

Related entities

Related topics (8)

Guides (1)

AI Safety Research: From Lab Policies to Real-World Flashpoints

Recent events (50)