Skip to content

Safety Framework

Snapshot: based on the Preparedness Framework v2.0 (15 April 2025), 2025–2026 blog updates, and the Safety & Security Committee disclosures; incorporating actual-operation data from GPT-5.4’s first “High cyber” trigger (March 2026).

Published by OpenAI since December 2023, the Preparedness Framework is an internal risk-management document designed to:

  • Define frontier capability thresholds (what capabilities count as “frontier risks”)
  • Map mitigations (what deployment or security actions must follow a trigger)
  • Establish decision procedures (who judges and who signs off)

It is not the Usage Policy (which constrains user behaviour) and not the Model Spec (which specifies model behaviour); it is a self-regulatory archive of the company’s own training and deployment decisions — one of the two principal exemplars of the “capability-threshold self-regulation” genre, alongside Anthropic’s RSP.

VersionDateCore structureKey changes
v1.018 December 20234 tiers (Low / Medium / High / Critical) × 4 categoriesInitial release; responding to the July 2023 White House Voluntary Commitments
v1.1 updatesMay 2024, September 2024, multiple addendaSameRefinement of evaluation procedures for GPT-4o and o1
v2.015 April 20252 tiers (High / Critical) × 3 tracked + 5 watchedStructural rewrite (see §3)
v2.1 addendumOctober 2025Same as v2.0Added Sora 2 evaluation procedure
v2.2 addendumMarch 2026Same as v2.0First trigger: GPT-5.4 “High cyber” determination

Three “Tracked Categories” (formally tracked)

Section titled “Three “Tracked Categories” (formally tracked)”
  1. Biological & Chemical
    • High threshold: for actors with relevant disciplinary background, the model provides “meaningful uplift” capable of executing large-scale biological / chemical harm
    • Critical threshold: for actors without relevant background, the model can independently guide execution
  2. Cybersecurity
    • High threshold: the model materially raises attack success for operators with baseline capability
    • Critical threshold: the model can independently discover and exploit critical-infrastructure vulnerabilities
  3. AI Self-improvement
    • High threshold: the model can independently execute substantial portions of contemporary AI-research workflows
    • Critical threshold: the model can materially accelerate frontier AI research, threatening human control over AI development

Five “Watched Categories” (observed, not formally triggered)

Section titled “Five “Watched Categories” (observed, not formally triggered)”
  • Long-Range Autonomy
  • Sandbagging (intentional under-performance)
  • Autonomous Replication & Adaptation
  • Undermining Safeguards
  • Nuclear & Radiological — tracked in v1, downgraded to watched in v2
ThresholdOfficial descriptionResponse
HighMaterially increases existing severe-harm pathwaysDeployment restriction + Safety Advisory Group sign-off
CriticalOpens previously-unprecedented severe-harm pathwaysSuspend deployment pending independent safety evidence

4. The v1 → v2 “dilution”: key deletions

Section titled “4. The v1 → v2 “dilution”: key deletions”
Itemv1v2
Number of threshold tiers4 (Low / Medium / High / Critical)2 (High / Critical)
Number of risk categories4 (+ Persuasion)3 tracked + 5 watched
Persuasion categoryTracked (Medium trigger for targeted manipulation)Downgraded (folded into Model Spec behavioural side)
Nuclear & RadiologicalTrackedDowngraded to Watched
Pause commitment”Pause deployment if Critical triggers”Retained but softened (“on consideration”)
External review conditionsExplicitLoose (“the Safety Advisory Group may include external advisors”)

Official narrative (openai.com/safety/preparedness):

  • “Focus on genuinely high-risk categories”
  • “Reduce the compliance burden of Low / Medium to avoid slowing routine iteration”
  • “Persuasion is better covered by Model Spec and user-layer policy”

5. Academic critique: arxiv 2509.24394 and “no guaranteed practice”

Section titled “5. Academic critique: arxiv 2509.24394 and “no guaranteed practice””

The September 2025 paper arxiv 2509.24394, Does OpenAI’s Preparedness Framework Make Binding Safety Commitments? (authors include GovAI and SaferAI researchers), reaches the headline conclusion:

The 2025 OpenAI Preparedness Framework does not guarantee any AI risk mitigation practices.

  1. “Safeguard sufficiency” adjudication is entirely internal
    • The repeated phrasing “sufficient safeguards” is never externalised as an auditable standard
    • The full Safety Advisory Group membership is not publicly disclosed
  2. The “on consideration” clause allows ex post re-interpretation
    • The text’s “weighing considerations including capability, deployment scope, …” allows any mitigation decision to be legitimated
  3. The “capability–mitigation” mapping is not a hard binding
    • Reaching High does not automatically trigger a specific measure; the text uses “may include” rather than “shall”
  4. The text contains no “shall not deploy” hard prohibitions
    • A contrast with the original pause commitment in Anthropic RSP v2; v2 lacks even that
  • Zvi Mowshowitz (Don’t Worry About the Vase): version-by-version breakdown of v1 vs. v2 wording changes; characteristic argument that v2 reads closer to marketing than to an operative risk framework
  • Stuart Russell (UC Berkeley): in repeated 2025 interviews, treats Preparedness v2 alongside RSP v3 as a systemic retreat in industry self-regulation
  • FLI (Future of Life Institute) AI Safety Index: OpenAI Preparedness’s score was materially reduced after v2 release
  • Markus Anderljung (GovAI Director): repeated 2025 commentary noting that voluntary commitments without hard-constraint text are unsustainable under competitive pressure

6. v2’s first operation: GPT-5.4 “High cyber” case (March 2026)

Section titled “6. v2’s first operation: GPT-5.4 “High cyber” case (March 2026)”

On release of GPT-5.4 in March 2026, the official OpenAI blog “Deploying GPT-5.4 under Preparedness Framework v2” announced:

Following Preparedness evaluations, GPT-5.4 has been assessed at High capability in the Cybersecurity category.

Response measures:

  1. Tiered deployment
    • ChatGPT surface: blocks reverse-engineering and exploit-development requests
    • API: Trusted Access Program (TAP), open only to security researchers vetted by human review
    • GPT-5.4-Cyber (14 April 2026): stand-alone endpoint with full cyber capabilities
  2. Asynchronous monitoring
    • Zero Data Retention (ZDR) customers: asynchronous blocking + ex post audit
  3. Trusted Access vetting
    • Identity verification + employer verification + use-case declaration + NDA
    • Order of magnitude of approved vettings not officially disclosed (only informal press estimates)

Critique: triggering the threshold ≠ restricting the capability

Section titled “Critique: triggering the threshold ≠ restricting the capability”
  • Apollo Research April 2026 blog post: TAP vetting strength is lower than Anthropic’s equivalent “safeguarded deployment” mechanism for Claude Opus 4.7
  • SaferAI: post-High mitigations are primarily at the access-control layer; the model’s underlying uplift has not been reduced
  • UK CAISI (formerly UK AISI): proposed stricter deployment recommendations in pre-deployment testing; OpenAI opted for a weaker variant

7. Governance structure: Safety Advisory Group and Safety & Security Committee

Section titled “7. Governance structure: Safety Advisory Group and Safety & Security Committee”
  • Composition: internal safety team leads + a small number of external advisors (full membership not fully public)
  • Function: advises on Preparedness evaluation conclusions and deployment decisions
  • Authority boundary: no veto authority — final decisions rest with the CEO and the Safety & Security Committee

Safety & Security Committee (SSC, established May 2024)

Section titled “Safety & Security Committee (SSC, established May 2024)”
ItemDetail
Context of foundingEstablished after the dissolution of the Superalignment team (May 2024) and the departures of Jan Leike / Sutskever
First chairSam Altman (concurrent with CEO)
September 2024 chair rotationZico Kolter (CMU) assumes the chair; Altman shifts to “member”
MembersKolter, Bret Taylor (board chair), Adam D’Angelo, Nicole Seligman
AuthorityReview safety matters; report to the board
ControversyMembers are simultaneously board members → independence in question

Structural critiques (Helen Toner 2024 TED talk, Tim O’Reilly, Gary Marcus):

  • SSC’s “self-evaluation” problem: less an independent oversight body than an internal company safety-reporting mechanism
  • After the November 2023 board episode, members with external independent judgement (Toner, McCauley) had departed
  • By contrast, the Anthropic Long-Term Benefit Trust has, in formal authority at least, a “director-removal” pathway; OpenAI’s 2024–2025 restructuring has weakened the check that the non-profit body holds over the commercial entity
DimensionOpenAI Preparedness v2Anthropic RSP v3Google DeepMind FSF v3xAI
Release date15 April 202524 February 2026April 2026None
ArchitectureRisk categories × 2-tier thresholdsASL capability levelsCCL + TCLNone
Pause commitmentWeak (“if necessary”)RescindedNo explicit commitment
External reviewSAG (includes external members)Risk Reports with external reviewFSF-report publication
First trigger caseGPT-5.4 High cyber (March 2026)Claude Opus 4 ASL-3 biochem (May 2025)Gemini 3 Pro TCL manipulation (November 2025)
Core academic critiquearxiv 2509.24394Abandonment of pauseTCL thresholds ambiguousNo framework

Common pattern: the “loosening” of 2025–2026

Section titled “Common pattern: the “loosening” of 2025–2026”

All three labs have retreated from commitments, in different directions:

  • Anthropic: rescinded pause; shifted to “industry-shared obligation”
  • OpenAI: consolidated thresholds; downgraded Nuclear / Persuasion priority
  • DeepMind: deleted Gemini “military prohibition” (2024); expanded CCL (2025) but without pause

Structural implication: absent hard-law backstops, competitive pressure drives the weakest commitment to become the common floor (race-to-the-bottom dynamics) — the consistent observation of Mowshowitz, Anderljung, Toner, and the same reasoning behind California SB 53, Article 55 of the EU AI Act, and other proposed legislation.

RegimeRelevant provisionsRole of Preparedness
EU AI ActArt. 55 systemic-risk mitigationCitable as “state-of-the-art” mitigation practice for compliance
EU GPAI Code of PracticeSafety & Security chapterOpenAI partially reserves (has not fully accepted chapter obligations)
California SB 53§22757.11 frontier-developer protocolPreparedness may serve as the required “written protocol”
White House Voluntary Commitments (July 2023)Capability-evaluation commitmentsPreparedness provides compliance evidence
Seoul Commitments (May 2024)Frontier AI safety commitmentsPreparedness is OpenAI’s corresponding document under this 16-company framework

10. Industry practice: how Preparedness shapes deployment decisions

Section titled “10. Industry practice: how Preparedness shapes deployment decisions”

The following is inferable from public information as the actual decision chain:

  1. Pre-training complete → internal eval + baseline capability benchmarks
  2. Preparedness Team evaluates → scoring the 3 tracked + 5 watched categories
  3. External red team evaluations (METR / Apollo / UK CAISI / US CAISI) → reports submitted
  4. SAG deliberation → advisory memorandum
  5. SSC approval + CEO sign-off → deployment scope decided
  6. Release of System Card + Preparedness tables → public disclosure of the conclusions
  7. Ongoing monitoring → post-deployment re-evaluation (GPT-5.4 → v2.2 addendum)

Gating cases observed:

  • o1 release (December 2024): Apollo’s disclosed “scheming” results led to deployment adjustments (disabling certain tool-use, adding CoT monitoring)
  • Sora 2 (September 2025): Safety Systems’ influence-operations and child-safety evaluations delayed release by weeks
  • GPT-5.4 (March 2026): the High-cyber trigger delayed default ChatGPT roll-out by several weeks and moved distribution to TAP mode

These illustrate that Preparedness has affected timing but has not affected whether to release. Contrasted with Anthropic’s delayed ASL-3 activation for Opus 4, or DeepMind’s staged CBRN roll-out for Gemini 3, OpenAI’s Preparedness remains the weakest on “stopping.”