Anthropic launches initiative to fund third-party AI safety evaluations
Anthropic announced a funded initiative to source third-party evaluations measuring advanced AI capabilities and safety risks, with priority areas including cybersecurity, CBRN threats, model autonomy, national security risks, social manipulation, and misalignment. The initiative is tied to Anthropic's Responsible Scaling Policy and AI Safety Level (ASL) framework, aiming to address a gap between demand and supply of high-quality safety-relevant evals. Proposals are solicited via an application form, with Anthropic framing the effort as benefiting the broader AI safety ecosystem rather than just internal use.
Related guides (3)
Related events (8)
Anthropic submits AI accountability recommendations to NTIA, covering evals, red teaming, and pre-registration
Anthropic submitted a formal response to the NTIA's Request for Comment on AI Accountability, outlining a multi-part policy framework for governing advanced AI systems. Key recommendations include increased government funding for evaluation research, mandatory disclosure of evaluation methods, pre-registration of large training runs with national governments, mandated external red teaming before model release, and antitrust guidance to enable industry safety collaboration. The submission reflects Anthropic's core policy positions and advocates for risk-tiered oversight proportional to model capabilities.
Anthropic publishes foundational 'Core Views on AI Safety' position paper
Anthropic released a detailed position paper outlining their core views on AI safety, arguing that transformative AI could arrive within a decade driven by predictable scaling laws, and that no one currently knows how to train powerful AI systems to robustly behave well. The document explains Anthropic's founding rationale and research strategy, highlighting four priority areas: scaling supervision, mechanistic interpretability, process-oriented learning, and understanding AI generalization. Originally published March 2023, this represents Anthropic's canonical public statement of their safety philosophy and strategic priorities.
Anthropic publishes Responsible Scaling Policy with AI Safety Level framework
Anthropic released its Responsible Scaling Policy (RSP), a formal framework of technical and organizational protocols for managing catastrophic risks from increasingly capable AI systems. The policy introduces AI Safety Levels (ASL-1 through ASL-5+), modeled on US biosafety level standards, requiring progressively stricter safety, security, and operational standards as models become more capable. Current Claude models are classified as ASL-2; ASL-3 triggers stricter deployment constraints including adversarial red-teaming requirements. The policy has been approved by Anthropic's board and is intended as a template for industry-wide adoption.
Anthropic advocates for third-party testing regime as core AI policy infrastructure
Anthropic published a policy position paper arguing that frontier AI systems require a third-party testing and oversight regime, distinct from self-governance approaches like their own Responsible Scaling Policy. The post outlines what such a regime should include: trusted third-party auditors, precisely scoped tests targeting only the most computationally intensive systems, and international coordination via shared standards and Mutual Recognition agreements. Anthropic acknowledges their RSP is insufficient alone because it relies on single private-sector actors, and calls for industry-wide mandatory testing that would eventually become a legal requirement for wide deployment.
Anthropic publishes structured harm assessment framework covering physical, psychological, economic, and societal impacts
Anthropic has released a policy document describing their evolving framework for assessing and mitigating AI harms across five dimensions: physical, psychological, economic, societal, and individual autonomy impacts. The framework complements their existing Responsible Scaling Policy and informs decisions on usage policies, red-teaming, detection, and enforcement. Concrete examples include safeguards for computer use capabilities (fraud, phishing) and a reported 45% reduction in unnecessary refusals in Claude 3.7 Sonnet through improved handling of ambiguous prompts. Anthropic frames this as a work-in-progress and invites collaboration from the broader AI ecosystem.
Anthropic raises $580M Series B to advance AI safety and interpretability research (2022)
Anthropic raised $580 million in a Series B round in April 2022, led by Sam Bankman-Fried of FTX, to fund large-scale infrastructure for AI safety research. The company, then ~40 people, outlined work on interpretability, steerability, and robustness of large language models. The round is historically notable both for Anthropic's early safety-focused mission and for the involvement of Sam Bankman-Fried, who was later convicted of fraud in the FTX collapse.
Dario Amodei's AI Safety Summit remarks detail Anthropic's Responsible Scaling Policy and ASL framework
Dario Amodei delivered prepared remarks at the UK AI Safety Summit (November 2023) explaining Anthropic's Responsible Scaling Policy (RSP), which was the first such policy published by a major AI lab. The RSP introduces AI Safety Levels (ASL-1 through ASL-4), modeled on biosafety level frameworks, with capability thresholds triggering mandatory safeguards before further training or deployment. Key implementation lessons include deep executive involvement, integrating RSP requirements into product roadmaps, and formal accountability through Anthropic's board and Long Term Benefit Trust. The remarks outline specific ASL-3 requirements around CBRN misuse prevention and security, and preview ASL-4 criteria involving near-human autonomy or becoming a primary source of global security threats.
Anthropic raises $124M Series A to build reliable, steerable AI systems
Anthropic announced a $124 million Series A round in May 2021, led by Jaan Tallinn with participation from Dustin Moskovitz, Eric Schmidt, and others. The company, founded by Dario and Daniela Amodei, plans to use the funding for computationally-intensive research into large-scale AI systems that are steerable, interpretable, and robust. The round represents Anthropic's founding-era capital raise, establishing its research agenda around AI safety, interpretability, and human feedback integration.


