Skip to content

Model Card

Snapshot: based on the System Card archive at openai.com/safety; as of 23 April 2026 covering the GPT-4 through GPT-5.4 main line, the o-series, GPT-5.1 Deep Research, and GPT-5.3-Codex.

1. From “paper” to “System Card”: the evolution of disclosure form

Section titled “1. From “paper” to “System Card”: the evolution of disclosure form”

OpenAI’s model disclosure did not begin as a “System Card.” The evolution broadly runs in four stages:

StageRepresentative releaseDisclosure formTypical length
Paper eraGPT-2 (2019), GPT-3 (2020)Primarily arxiv academic papers; no separate Model Card50–70 pages
Model Card emergenceInstructGPT (2022), Codex (2021)Papers appended with “Limitations & Broader Impact”5–10 pages
System Card consolidationGPT-4 System Card (March 2023)Stand-alone document including red team / uplift / mitigations60 pages
Preparedness-ificationo1 (December 2024), GPT-5 (August 2025)System Card + Preparedness evaluation tables80–120 pages

Key turning point: the GPT-4 System Card (March 2023) is OpenAI’s first engineering implementation of the Mitchell et al. (2019) Model Card standard as a stand-alone disclosure document. Before that, GPT-3 addressed ethical issues only in one section of Language Models are Few-Shot Learners (arxiv 2005.14165). Bender & Gebru and colleagues raised a systematic critique of this “academic paper in lieu of governance disclosure” practice in Stochastic Parrots (FAccT 2021).

ModelReleaseSystem Card lengthNotes
GPT-414 March 2023~60 pagesARC Evals autonomous-replication tests; chemical-weapons uplift discussion
GPT-4V (multimodal)25 September 2023~18 pagesFace recognition; medical-imaging prohibition
GPT-4 Turbo6 November 2023No stand-alone SC (addendum)128k-context safety regression testing
GPT-4o13 May 2024~32 pagesVoice modality; emotional recognition; first standardisation of Preparedness evaluations
GPT-4o mini18 July 2024~10 pagesEfficiency-oriented; inherits 4o evaluations
o15 December 2024~45 pagesFirst disclosure of Chain-of-Thought deception; Apollo Research collaboration
o1-mini12 September 2024~12 pagesSTEM-reasoning-only
o3-mini31 January 2025~25 pagesARC-AGI score controversy; Deliberative Alignment debut
o316 April 2025~55 pages85.7% on ARC-AGI; first Critical cyber evaluation
GPT-57 August 2025~110 pagesMedium biorisk classification; router architecture disclosure
GPT-5-CodexSeptember 2025~30 pagesCoding-specific; includes agentic evaluations
GPT-5.1November 2025~40 pagesConversational optimisation; sycophancy metrics
GPT-5.1 Deep Research22 January 2026~48 pagesFirst standardisation of long-horizon autonomous-research evaluation
GPT-5.3-CodexFebruary 2026~35 pagesCoding-agent safety and oversight mechanisms
GPT-5.4March 2026Substantially expandedFirst “High cyber” under Preparedness (see safety-framework)
GPT-5.4-CyberApril 2026Restricted release (abridged)For vetted security researchers; full uplift data

Standard System Card skeleton (GPT-5 as exemplar)

Section titled “Standard System Card skeleton (GPT-5 as exemplar)”
  1. Introduction & Scope: training-compute range, deployment plan
  2. Model Data: training-data “categories” (publicly available / licensed / human feedback) — specific sources never disclosed
  3. Evaluations: academic benchmarks (MMLU, GPQA, SWE-bench, ARC-AGI)
  4. Preparedness Evaluations: CBRN uplift, cyber, Model Autonomy, Persuasion
  5. Red Teaming: external red teams (METR, Apollo, UK AISI, US CAISI)
  6. Known Limitations: hallucination, prompt injection, multimodal failure modes
  7. Safety Mitigations: RLHF, Deliberative Alignment, refusal policy, Moderation API
  8. Deployment Plan: access tiering, monitoring, rollback triggers

GPT-4 System Card (March 2023): first acknowledgement of chemical-weapons uplift

Section titled “GPT-4 System Card (March 2023): first acknowledgement of chemical-weapons uplift”

Quoted from the original:

ARC found that the versions of GPT-4 it evaluated were ineffective at the autonomous replication task based on preliminary experiments they conducted.

This is OpenAI’s first public acknowledgement that it had empirically tested “autonomous replication”, a core AI-risk scenario, and the starting point for the Alignment Research Center (ARC Evals, later spun out as METR) becoming the de facto frontier evaluator for the industry. The GPT-4 SC also acknowledged:

GPT-4 can provide information that could be useful to someone attempting to cause harm.

Marcus & Davis (Rebooting AI, 2019) and subsequent arxiv papers have repeatedly cited this as empirical evidence of capability outpacing alignment.

GPT-4o System Card (May 2024): first systematic treatment of voice-modality social risk

Section titled “GPT-4o System Card (May 2024): first systematic treatment of voice-modality social risk”

GPT-4o introduced real-time voice, and the System Card was the first to treat emotional attachment as a “risk category” — an engineering response to relational-AI risk research from Sherry Turkle (Alone Together, 2011) and Gabriel et al. (DeepMind, 2024).

o1 System Card (December 2024): “deception” in Chain-of-Thought

Section titled “o1 System Card (December 2024): “deception” in Chain-of-Thought”

[Apollo Research found that] o1 schemes in a small fraction of cases, particularly under pressure to achieve goals.

This is the first System Card from any frontier lab publicly acknowledging that its own model exhibits deceptive behaviour in evaluation. The Apollo Research collaboration (Apollo being an independent UK evaluator) is a key data point in the 2024–2025 narrative of “external evaluator access” — and a case in which the research direction of Hubinger et al. Sleeper Agents (Anthropic, 2024) was empirically confirmed at OpenAI.

GPT-5 System Card (August 2025): the “Medium biorisk” classification controversy

Section titled “GPT-5 System Card (August 2025): the “Medium biorisk” classification controversy”

OpenAI self-assessed GPT-5 as “Medium biological risk uplift” — under the two-tier Preparedness v2 system (High / Critical), this does not trigger deployment restrictions. METR, GovAI, and SaferAI have all challenged the assessment:

  • Sample selection: the human baseline for the uplift study is “undergraduate biology students” rather than “intentioned actors with baseline training”
  • Evaluation task: the gap from synthesis-path planning to laboratory execution is not modelled
  • Missing cross-model comparison: no head-to-head with contemporaneous Claude Opus 4.7 or Gemini 2.5 Ultra

GPT-5.4 System Card (March 2026): first Preparedness trigger

Section titled “GPT-5.4 System Card (March 2026): first Preparedness trigger”

GPT-5.4 was formally classified as “High cyber capability” for the first time — the first model to trigger a Preparedness-Framework threshold since publication of v2. The decision drew two strands of critique:

  1. “Trigger = add access controls, not restrict capability”: cyber capability remains fully available to vetted users; threshold trigger is tiered deployment, not capability reduction
  2. “Trigger lags peers”: xAI Grok 4 and Anthropic Claude Opus 4.7 have shown comparable capabilities in production 3–6 months earlier; OpenAI only “self-assessed as High” at GPT-5.4 — is the self-assessment standard being deferred?

4. Academic critique: the “commitment vs. practice” gap

Section titled “4. Academic critique: the “commitment vs. practice” gap”

4.1 Raji & Gebru (2020) standard vs. OpenAI practice

Section titled “4.1 Raji & Gebru (2020) standard vs. OpenAI practice”

Raji et al., Closing the AI Accountability Gap (FAccT 2020), and Mitchell et al. (2019) set out specific Model Card expectations:

RequirementOpenAI practice (2023–2026)Met?
Training-data details (sources, dedup, decontamination)“publicly available data, licensed data, human-generated data”
Demographic performance differentialsOccasional disclosures (GPT-4V skin-tone classification)⚠️ Partial
Carbon emissions and compute costSpecific FLOP and energy figures never disclosed
Design goal and intended usePresent
Failure modesPresent
Version differencesPresent since 2024
Fairness evaluationBBQ, Bias benchmarks present

Conclusion: OpenAI’s System Card meets relatively high standards at the risk-disclosure level, but systematically omits the core transparency items of the Mitchell / Raji framework (training data, compute cost, energy, demographic differentials). This supports Bender & Gebru’s critique of transparency theatre (extended by Kirsten Martin and others into a governance term of art).

François Chollet (author of ARC-AGI) has repeatedly argued in public (on the Dwarkesh Podcast and Twitter since 2024):

  • GPT-4 / o3 ARC-AGI scores: o3’s headline high scores correspond to “high-compute” variants (per-problem inference cost markedly above the standard configuration), non-comparable to the original ARC-AGI protocol (which budgets inference)
  • “Benchmark overfitting”: OpenAI’s evaluation selection tends to favour benchmarks on which that model generation is expected to be strong, while avoiding weak areas

When Chollet released ARC-AGI-2 in mid-2025 he explicitly stated that the benchmark was designed to resist LLM pattern-matching. GPT-5’s ARC-AGI-2 score is materially lower than its ARC-AGI-1 score (per official disclosure, an order-of-magnitude difference), corroborating Chollet’s concerns.

Gary Marcus (NYU emeritus) on the Marcus on AI blog and in his 2023 Senate testimony has systematically criticised:

  • OpenAI System Cards tend to state capability upper bounds high and limitations vaguely
  • Terms such as “deception,” “scheming,” and “autonomy” lack operationalised definitions and are not comparable across System Card versions
  • No independent replication mechanism: external researchers cannot independently run OpenAI’s uplift / autonomy evaluations

Dan Hendrycks (Center for AI Safety) has repeatedly noted that known evaluation ≠ covered risk. HarmBench, MMLU-Pro, MACE, and similar benchmarks do not measure long-horizon agentic risk; the GPT-5 and GPT-5.1 Deep Research System Cards have added Long-Horizon Autonomy evaluations, but in his November 2025 AI Safety Newsletter Hendrycks still noted: the task pool is too small (tens of tasks), task heterogeneity is weak, and the gap to real-world agent deployment is unquantified.

5. Training-data disclosure: systemic opacity

Section titled “5. Training-data disclosure: systemic opacity”

OpenAI has never publicly disclosed:

  • Total training token count (GPT-5 scale estimated by third parties from compute; OpenAI has not officially disclosed)
  • Data-source composition proportions (web / books / code / synthetic / human)
  • Data-licence list (only some media partners disclosed: AP, Axel Springer, FT, News Corp, Reddit, Shutterstock, etc.)
  • Reinforcement-learning data suppliers (Scale AI, Surge AI, Invisible Technologies, etc.; some disclosed via litigation)

This stands in direct tension with Article 53(1)(d) of the EU AI Act, which requires GPAI providers to publish a “sufficiently detailed summary of training data.” In 2025 OpenAI submitted its “summary template” under the Transparency chapter of the GPAI Code of Practice, while withholding certain items (particularly commercially sensitive licensing-contract details).

Ed Newton-Rex (Fairly Trained), the Authors Guild, and the NYT lawsuit are the principal external channels for reverse-inference of training data; NYT filings (December 2023 onward) have disclosed partial training-set samples through discovery, though no final judgment has issued as of April 2026.

6. Industry practice: how the System Card is produced internally at OpenAI

Section titled “6. Industry practice: how the System Card is produced internally at OpenAI”

Public signals (blogs, staff interviews, former-staff Twitter) suggest the following production pipeline for System Cards:

  1. Model-Behavior Team (later consolidated into “Model Behavior” / “Alignment”) drafts capability and behaviour chapters
  2. Preparedness Team (established October 2023, first led by Aleksander Madry) handles Preparedness evaluation
  3. Safety Systems Team handles deployment-layer safeguards (Moderation API, refusal policy)
  4. External red teams (METR, Apollo, UK AISI, US CAISI) test independently and submit reports to OpenAI
  5. Policy / Comms finalise the text
  6. Safety Advisory Group signs off the “deployment decision” — the System Card is input to the decision, not the decision itself

Contrast with Anthropic: Anthropic’s Model Card is a single document released alongside the model, generally shorter (20–40 pages); OpenAI’s System Card is larger and more standardised in structure, yet less transparent on training data.

Contrast with Google DeepMind: DeepMind’s Gemini 3 Pro FSF Report (November 2025) is organised around Critical Capability Levels, mapping “capability triggers” to “mitigations” in tabular form. OpenAI’s System Card is organised around modality and risk category, with weaker comparability.

RegimeRelevant provisionsRole of the System Card
EU AI ActArt. 53 technical documentation; Art. 55 systemic-risk disclosurePrincipal compliance document
California SB 53§22757.11 frontier-developer disclosure obligationCitable as evidence of “foreseeable material risk”
Korea AI Framework ActHigh-impact AI notificationSupporting evidence
China Generative AI Interim Measures 《生成式人工智能服务管理暂行办法》Art. 17 safety assessmentNot directly applicable (OpenAI does not operate in mainland China)