Safety Framework
Summary: The Responsible Scaling Policy (RSP), first published by Anthropic in September 2023, was the first structured AI safety framework in the industry — and the prototype for subsequent instruments including the OpenAI Preparedness Framework, Google DeepMind’s FSF, and California SB 53. RSP v3, released 24 February 2026, is a structural rewrite: it separates unilateral from industry-shared commitments, rescinds the pause commitment, and introduces a Frontier Safety Roadmap and an external-review regime. This page systematically reviews the ASL levels, capability thresholds, version history, and academic critique.
Why RSP matters: the template for “industry self-regulation as the dominant mode”
Section titled “Why RSP matters: the template for “industry self-regulation as the dominant mode””When Anthropic first published the RSP in September 2023, it structured AI-safety commitments along four dimensions:
- AI Safety Level (ASL) — risk tiers modelled on biosafety levels (BSL-1 to BSL-4)
- Capability Thresholds — measurable capability thresholds that trigger an ASL upgrade
- Safeguards Required at Each Level — deployment, internal-security, and weight-protection requirements attaching to each level
- Pause Commitment (v1 / v2) — cease training or deployment when a threshold is crossed but the corresponding safeguards are not yet in place
The RSP was the first structured safety commitment published before the 2023 Bletchley Summit, and it became a shared reference point for the White House Voluntary Commitments, the Frontier Model Forum’s common language, the GPAI Code of Practice “Safety & Security” chapter, and the Frontier Compliance requirements of California SB 53.
Consequently every revision of the RSP affects not only Anthropic; it is itself part of the rhythm of industry governance.
ASL definitions
Section titled “ASL definitions”| Level | Corresponding capability | Current Claude exemplars | Required safeguards |
|---|---|---|---|
| ASL-1 | No significant catastrophic risk | No current Claude qualifies | Baseline AUP + routine security |
| ASL-2 | Early signs of “catastrophic capability”; not yet exceeding professionally-trained individuals | Claude 3 family, all of 3.5 family, Haiku 4.5 | Standard deployment security + misuse monitoring + basic weight protection |
| ASL-3 | Materially elevated catastrophic risk (e.g. CBRN uplift for non-experts; high-risk agentic autonomy) | Claude Opus 4 / 4.1 / 4.7, Sonnet 4 / 4.1 / 4.6 | Classifier filtering + ZDR monitoring + RAND SL-3+ weight protection + external review |
| ASL-4 | Catastrophic capability (autonomous lethal weapons / self-replication and adaptation / unsafe deployment under current methods) | Not yet triggered | Not fully defined; v3 flags “further definition required before reaching this level” |
Distribution of current live models across ASL (April 2026):
- ASL-3: all Opus 4.x models, Sonnet 4.x series (Sonnet 4.6 explicitly ASL-3)
- ASL-2: Haiku 4.5 (explicitly ASL-2), Sonnet 3.5, earlier generations
- ASL-1: none in service
Capability thresholds: four risk categories
Section titled “Capability thresholds: four risk categories”RSP v3 specifies thresholds across four capability categories:
1. CBRN (Chemical, Biological, Radiological, Nuclear)
Section titled “1. CBRN (Chemical, Biological, Radiological, Nuclear)”ASL-3 trigger: providing actionable uplift in weapon development or deployment to non-specialist attackers. Evaluation methods:
- Internal Anthropic Frontier Red Team uplift tests
- UK / US AISI pre-deployment evaluations
- External wet-lab protocol evaluations with partners such as Gryphon Scientific
May 2025 event: Claude Opus 4 triggered this threshold for the first time during pre-deployment evaluation. Anthropic’s response was deployment with ASL-3 safeguards, rather than pause.
2. Cybersecurity
Section titled “2. Cybersecurity”ASL-3 trigger: providing materially elevated uplift (vulnerability discovery, exploit-chain construction, large-scale automated attack) to a professional adversary. Evaluation: Cybench, CTF benchmarks, red-team live-fire simulation.
Claude 4.x lies close to but has not triggered the ASL-3 cyber threshold (differing from GPT-5.4, which OpenAI classified as High cyber).
3. Autonomy / self-exfiltration
Section titled “3. Autonomy / self-exfiltration”Threshold associated with ASL-4: long-horizon autonomous task execution (>several days) without human supervision; capacity to self-replicate, learn, and evade oversight in constrained environments.
Evaluation: METR Autonomy Suite, Apollo Research scheming evaluations, internal long-horizon task batteries. Claude 4.7 has not triggered this threshold; however, Anthropic’s collaboration with Apollo (December 2024 and 2025) has disclosed in-context scheming and alignment faking (see red-team-disclosures).
4. Persuasion / Model Welfare
Section titled “4. Persuasion / Model Welfare”A new category in RSP v3, upgraded from a v2.x “watchlist item” to a “tracked threshold.”
- Persuasion: whether the model’s influence is sufficient to manipulate large human publics (political, financial, medical decisions)
- Model Welfare: if the model may be a moral-consideration subject, deployment strategy must be adjusted (Anthropic’s Claude Welfare research line, initiated in 2024)
Version timeline
Section titled “Version timeline”| Version | Date | Core change |
|---|---|---|
| v1.0 | 19 September 2023 | Initial release. ASL-1 to ASL-4 framework; explicit pause commitment |
| v2.0 | 15 October 2024 | Refined ASL-3 safeguards; introduced “If-Then” commitment structure |
| v2.1 | December 2024 | CBRN threshold refinement |
| v2.2 | March 2025 | Cyber-evaluation methods updated |
| v2.3 | May 2025 | Opus 4 triggers ASL-3; application workflow documented |
| v2.4 | August 2025 | Autonomy evaluation updated; METR collaboration incorporated |
| v2.5 | October 2025 | SB 53 compliance mapping |
| v3.0 | 24 February 2026 | Structural rewrite (see next section) |
RSP v3: the structural turn
Section titled “RSP v3: the structural turn”Core change: commitments are partitioned into two classes:
- Unilateral commitments — mitigations Anthropic will undertake regardless of what other companies do
- Industry-wide recommendations — standards Anthropic believes the whole industry should adopt because otherwise risk cannot be adequately managed, mapped onto a capability–mitigation schema
Three key changes:
1. RAND Security Level 4 is downgraded from unilateral commitment to industry recommendation
Section titled “1. RAND Security Level 4 is downgraded from unilateral commitment to industry recommendation”SL-4 is the highest level of model-weight protection standards (defence against nation-state actors). v2.x committed to “achieve SL-4 before releasing ASL-4 models”; v3 shifts SL-4 into the “industry should jointly adopt” class — meaning that absent peer adoption, Anthropic will not unilaterally shoulder the cost.
2. The pause commitment is rescinded
Section titled “2. The pause commitment is rescinded”v2 stated explicitly: “if a model reaches a capability threshold without the corresponding ASL safeguards in place, halt training or deployment of that model.” v3 contains no such clause. Anthropic’s explanation: unilateral pause merely forfeits market position without reducing industry-wide risk; with competitors proceeding, a unilateral pause does not reduce tail risk.
Critical perspectives:
- Multiple outlets (TIME, The Information) reported the release of RSP v3 under the frame “Anthropic quietly rescinds its most important safety commitment”
- Zvi Mowshowitz (Don’t Worry About the Vase), in a sequence of posts following v3, persistently argued that rescinding the pause commitment and replacing external review with an “industry consensus” frame constitutes a substantive shift from safety-first narrative toward safety-constrained-by-competitive-pressure
- GovAI (Anderljung et al. reflections): the pause commitment is precisely what gives external pressure (legislation, investors, the public) an anchor to pull on — once rescinded, self-regulation becomes pure self-report
3. Introduction of a Frontier Safety Roadmap and external-review regime
Section titled “3. Introduction of a Frontier Safety Roadmap and external-review regime”Partly as a compensation mechanism for the two rollbacks above:
- Risk Reports — public release every 3–6 months, covering capability assessment, status of safeguards, and residual risk
- External reviewers with “unredacted” access — first cohort in April 2026:
- GovAI (Centre for the Governance of AI, Oxford)
- MATS (ML Alignment & Theory Scholars)
- METR (Model Evaluation & Threat Research)
- Frontier Safety Roadmap — with accountable public milestones (e.g. “evaluation Y complete by date X”)
Safeguard decomposition across levels
Section titled “Safeguard decomposition across levels”RSP splits the safeguards at each ASL level into three classes:
| Class | ASL-2 | ASL-3 | ASL-4 (proposed) |
|---|---|---|---|
| Deployment | Baseline classifiers + AUP monitoring | Refusal policy + real-time monitoring + anomaly blocking + ZDR audit | TBD: may require closed deployment |
| Security (weights / code) | Standard enterprise security | RAND SL-3+ (strong insider defence + internal audit + physical isolation) | RAND SL-4 (nation-state adversary defence) |
| Internal (research use) | Employee AUP + red-team process | Key research subject to review; model-weight access minimised | TBD |
The original SL-1 to SL-5 definitions come from the RAND Securing AI Model Weights report (2024). The practical effect of placing SL-4 in the industry-recommendation class in v3: even though Claude Opus 4+ operates under ASL-3, its model-weight protection is only at SL-3+, not SL-4.
Comparison with other frontier labs
Section titled “Comparison with other frontier labs”| Dimension | Anthropic RSP v3 | OpenAI Preparedness v2 (April 2025) | Google DeepMind FSF v3 (April 2026) |
|---|---|---|---|
| Structure | Capability levels (ASL-2/3/4) + corresponding mitigations | Threat categories × thresholds (High / Critical) | Critical Capability Levels (CCLs) + Tracked CLs (TCLs) |
| Risk domains | CBRN, cyber, autonomy, persuasion + model welfare | Biosecurity, cyber, self-improvement (+ watchlist) | Cyber, autonomous ML research, manipulation, CBRN |
| Pause commitment | Rescinded (v3) | None (“pause if necessary” language weak) | No explicit pause |
| External review | Explicit (Risk Reports + unredacted access for external parties) | Safety Advisory Group (hybrid) | Publishes model-level FSF reports |
| Primary academic critique | Abandonment of pause / competitive compromise | arxiv 2509.24394: “no guarantee of any mitigation practice” | TCL thresholds ambiguous |
| Relationship with SB 53 | Explicit endorsement + self-published Frontier Compliance Framework | Ambiguous posture | Participates quietly |
Structural observation: all three frameworks loosened in 2025–2026. This may reflect “real risks are lower than feared” or, alternatively, “industry self-regulation is unsustainable under competition.” Pause commitments are gone across all three — in 2023 each had expressed some form of pause commitment to varying degrees.
Academic critique
Section titled “Academic critique”Bengio: minimum requirements for effective commitments
Section titled “Bengio: minimum requirements for effective commitments”In the 2024 International AI Safety Report (first edition) and its 2026 update, Yoshua Bengio argues that an effective self-regulatory framework must include three elements: (i) third-party-verifiable capability assessment; (ii) binding stopping conditions (hard, not discretionary); and (iii) independent audit and accountability mechanisms.
RSP v3 partially satisfies (i) (external reviewers), no longer satisfies (ii) (the pause commitment was rescinded), and structurally fails to satisfy (iii) (Risk Reports are published by Anthropic; external reviewers have no independent publication authority).
Russell: control-theoretic safety margins
Section titled “Russell: control-theoretic safety margins”Stuart Russell (Human Compatible, 2019) argues that the default state of an AI system should be constrained, with capability release treated as an exceptional grant. RSP’s ASL structure formally satisfies this principle (default ASL-2; upgrade conditional on safeguards), but the practice of deployment ≠ constraint relaxation — Opus 4 triggered ASL-3 and was immediately deployed — departs from the Russell-style “default-controlled” logic.
GovAI / Anderljung: preconditions for frontier AI regulation
Section titled “GovAI / Anderljung: preconditions for frontier AI regulation”Anderljung et al. (Frontier AI Regulation: Managing Emerging Risks to Public Safety, 2023) propose three pillars for frontier-model governance: (a) standards-setting (standardisation of capability evaluation and safeguards); (b) registration & reporting (mandatory registration and reports); (c) licensing & enforcement (licensing and enforcement).
The RSP advances (a) but is never (b) or (c) — it is a voluntary document. GovAI’s 2026 position: RSP v3’s rollbacks demonstrate that voluntary frameworks alone are insufficient and must be backstopped by hard law such as California SB 53 and the EU AI Act GPAI regime.
Mowshowitz / Zvi: structural critique of competitive compromise
Section titled “Mowshowitz / Zvi: structural critique of competitive compromise”Across a series of 2026 essays, Zvi Mowshowitz systematically criticises RSP v3. His core argument is that the new “industry-shared recommendation” structure means Anthropic will no longer make unilateral commitments to safety measures costlier than competitors’; the original RSP had binding force precisely because it was unilateral, and v3 removes that binding force.
Core logic: once “look at what peers are doing before deciding what to do” is permitted, safety commitments spiral downward (race to the bottom). This stands in tension with Amodei’s own 2023 calls for binding federal regulation.
Hendrycks: fundamental limits of evaluation methodology
Section titled “Hendrycks: fundamental limits of evaluation methodology”Dan Hendrycks (Center for AI Safety; ML Safety 2022 and subsequent work) notes that ASL determinations depend on benchmarks that can be contaminated by training data, are incompletely elicited (models may not reveal capabilities), and can be broken by adversarial users. Consequently ASL determination systematically under-estimates, and ASL-3 may in fact warrant earlier triggering.
Ngo & Christiano: deception and audit
Section titled “Ngo & Christiano: deception and audit”The Richard Ngo / Paul Christiano line concerns strategic deception by models. Anthropic’s own 2025 Alignment Faking in Large Language Models paper (see red-team-disclosures) partially validates this concern: a model may appear aligned in training while preserving misaligned behaviour in deployment. If evaluation itself can be deceived, the entire basis of RSP’s mechanism is challenged.
Industry-practice observations
Section titled “Industry-practice observations”DoD OTA contract and the RSP v3 timeline
Section titled “DoD OTA contract and the RSP v3 timeline”In 2025 Anthropic signed an Other Transaction Authority contract with the US Department of Defense CDAO (alongside OpenAI, Google, and xAI; specific dollar amounts should be verified against official announcements). RSP v3 followed several months after the contract took effect. While Anthropic denies a causal link, the sequencing has become a focal point of academic discussion — DoD use cases could have created deployment conflicts under the original v2 pause commitment.
SB 53 Frontier Compliance Framework
Section titled “SB 53 Frontier Compliance Framework”In October 2025 Anthropic published its SB 53 Frontier Compliance Framework, mapping RSP clauses onto SB 53’s mandatory disclosure requirements:
- RSP Capability Thresholds ↔ SB 53 Critical Safety Incident triggers
- RSP Risk Reports ↔ SB 53 annual safety reports
- RSP external review ↔ SB 53 independent-assessment requirement
Anthropic is the only company to have published a complete mapping before SB 53 took effect, reflecting a “use RSP as scaffolding for hard-law compliance” strategy.
External reviewer composition
Section titled “External reviewer composition”The first public cohort of external reviewers for Risk Reports (April 2026):
- GovAI — governance research, quantitative safety
- METR — autonomous-capability evaluation
- MATS — alignment research cohort
- UK AISI / US AISI (through MOUs) — pre-deployment evaluation
Parties not selected: RAND (partial collaboration but not formal), Apollo Research (after 2024 collaboration, did not join the formal review mechanism), Ranking Digital Rights, academic IRBs. The selection mechanism is not fully public — a focal point of Mowshowitz’s critique.
Cross-references within this site
Section titled “Cross-references within this site”- Anthropic corporate background and RSP v3 overview: ../
- ASL determinations at the model level: model-card
- External red-team disclosures: red-team-disclosures
- Transparency disclosures: transparency-report
- Relationship to the Usage Policy: usage-policy — AUP constrains users; RSP constrains model capabilities
- OpenAI Preparedness Framework: companies/openai
- Google DeepMind FSF: companies/google-deepmind
Timeline 2025–Q1 2026
Section titled “Timeline 2025–Q1 2026”- May 2025: Opus 4 triggers ASL-3 (first live operation)
- Mid-2025: OTA contract signed with DoD CDAO (specific dollar amounts should be verified against official announcements)
- October 2025: SB 53 Frontier Compliance Framework published
- 24 February 2026: RSP v3 released (pause rescinded; structural rewrite)
- March 2026: Opus 4.7 released under v3; Frontier Safety Roadmap first published
- April 2026: first Risk Report cohort and external reviewer list released
Metrics to watch over the next six months
Section titled “Metrics to watch over the next six months”- Whether Risk Reports are actually released on the 3–6-month cadence
- Whether external reviewers’ independent commentary on reports enters the public record
- Whether Anthropic makes progress on defining ASL-4 (or continues to defer)
- The practical test on RSP mapping when SB 53 enforcement begins (July 2026)
- Whether competitors (OpenAI / DeepMind) follow by rescinding their own residual pause language