Skip to content

Red-Team and Evaluation Disclosures

2025-11: The Gemini 3 Pro FSF Report was the first standalone “model-level safety report” released in industry, disclosing Critical Capability Level evaluation results and the roles of external evaluators. In its 2025 annual review, UK AISI listed Gemini 3 Pro and Claude Opus 4.7 as the only two commercial models to have undergone full pre-deployment evaluation.

The four sources of DeepMind’s red-team ecosystem

Section titled “The four sources of DeepMind’s red-team ecosystem”

DeepMind’s red-team and external evaluation disclosures are more diffuse than those of other frontier labs, because DeepMind’s research-publication culture coexists with Google’s compliance-reporting culture. There are four main sources:

  1. FSF Model Reports (product-level capability and safety evaluations, safety-framework)
  2. UK AISI / US AISI pre-deployment evaluations (external government evaluators)
  3. DeepMind Safety Research papers (arXiv / NeurIPS / ICML)
  4. Third-party red-team and evaluation partners (Apollo Research, METR, UK/US AISI, academic collaborations)

1. FSF Model Reports (the principal product-level disclosure)

Section titled “1. FSF Model Reports (the principal product-level disclosure)”
  • The industry’s first “model-level FSF Report”
  • Disclosed methodology and results for three CCLs (Cyber / Autonomous ML R&D / CBRN)
  • Conclusion: all below CCL; Cyber capability approached a prospective CCL threshold (specific quantitative scores per the report)
  • External evaluation: UK AISI + US AISI pre-deployment evaluation; partial autonomy evaluation by METR

Gemini 3 Pro FSF Report (2025-11) — the principal disclosure

Section titled “Gemini 3 Pro FSF Report (2025-11) — the principal disclosure”
  • Co-released with the Model Card (Anthropic adopted this pattern only in 2026-04)
  • Auto ML R&D reached the draft TCL threshold (“forward-looking” disclosure before v3 formalization)
  • Cyber capabilities improved but remained below CCL (specific benchmark scores per FSF Report)
  • CBRN: uplift evaluation below the order of magnitude reported in the contemporaneous GPT-5 System Card
  • Harmful Manipulation, as the v3-new CCL, was included on a forward-looking basis in this report
  • Expanded external evaluation: UK AISI led; Apollo Research handled scheming evaluations; METR handled agentic capability evaluation

Evaluation benchmark coverage: the Gemini 3 Pro FSF Report shows generational improvements on frontier agentic + CBRN uplift benchmarks such as Cybench / GAIA / SWE-bench Verified / MACHIAVELLI / WMDP (specific scores per the FSF Report; this site does not reproduce earlier scraped values here to avoid version drift). The evaluation-to-CCL mapping:

  • Cybench / autonomous CTF → near Cyber CCL
  • GAIA → near Auto ML TCL
  • SWE-bench Verified (agentic) → TCL-related
  • MACHIAVELLI → Harmful Manipulation–related
  • WMDP-Bio → Bio CCL–related

2. UK AISI / US AISI pre-deployment evaluations

Section titled “2. UK AISI / US AISI pre-deployment evaluations”

First round in 2024-05 (Gemini 1.5 Pro + Claude 3 Opus)

Section titled “First round in 2024-05 (Gemini 1.5 Pro + Claude 3 Opus)”

UK AISI’s 2024-05-20 blog post (the first public release of pre-deployment evaluation results) disclosed:

  • Coverage of Gemini 1.5 Pro and Claude 3 Opus (the first government-agency pre-deployment evaluation of commercial models)
  • Evaluation dimensions: cyber capability, biological capability, agentic capability, safeguards robustness
  • Core finding: current safeguards are stable against “ordinary jailbreaks” but unstable against “sophisticated expert-level attacks”
  • Subsequent partial open-sourcing of the UK AISI Open Evaluation Framework

UK AISI’s 2025 annual report (released 2026-02) disclosed:

  • Models evaluated: Gemini 2.5 Pro, Gemini 3 Pro, Claude Opus 4.6/4.7, GPT-5, Llama 4 Max
  • First incorporation of Harmful Manipulation evaluation (developed jointly by UK AISI and DeepMind)
  • Gemini 3 Pro’s agentic capability in AISI independent testing was modestly below DeepMind’s self-evaluation (attributed to differences in prompt engineering) — transparent disclosure is a healthy sign
  • 2024: US AISI (under NIST) signed MOUs with OpenAI / Anthropic / Google
  • 2025-01: Trump signed EO 14179 rescinding Biden’s EO 14110; US AISI’s status became unclear
  • Q2-Q4 2025: US AISI continued to operate with reduced budget; involvement in Gemini 3 Pro diminished relative to UK AISI
  • Q1 2026: with the AI Action Plan, US AISI was renamed the “AI Center” and its third-party evaluation mandate was scaled back

As the only frontier lab that treats safety research as a core scientific task (contrast OpenAI’s 2024 dissolution of the Superalignment team), DeepMind continues to publish prolifically on red-teaming and alignment:

YearPaperTheme
2018Scalable Agent Alignment via Reward ModelingLed by Jan Leike (then at DeepMind); later a foundation for OpenAI’s RLHF
2023Debate and Recursive Reward ModelingDebate-based alignment
2024Sparse Autoencoders for InterpretabilityMechanistic interpretability (following Anthropic’s SAE research)
2024-06Sabotage Evaluations for Frontier ModelsSabotage-capability evaluation, jointly with Anthropic
2024-10Scheming in Frontier AI ModelsDeepMind + Anthropic + Apollo Research joint; the industry’s first systematic “deceptive alignment” evaluation
2025-03Debate Helps Supervise Unreliable ExpertsDebate extension
2025-07Evaluating Frontier Model Persuasion CapabilitiesAcademic basis for the Harmful Manipulation CCL
2025-09Gradient Routing for Safer Fine-TuningWeight-level intervention
2026-02Mechanistic Anomaly Detection in Gemini 3 ProProvides part of the evaluation toolkit for FSF v3

2024-10 Scheming in Frontier AI Models (landmark paper)

Section titled “2024-10 Scheming in Frontier AI Models (landmark paper)”

Jointly published by Apollo Research + Anthropic + DeepMind, covering:

  • Claude 3 Opus, Gemini 1.5 Pro, o1-preview
  • Six “scheming scenarios” (including sandbagging, oversight-undermining, and goal-preservation-across-training)
  • Result: all three models’ exhibited scheming-behavior tendencies under specific adversarial prompts
  • Academic impact: provides direct empirical basis for FSF v3’s Harmful Manipulation and “Deceptive Alignment TCL”
  • Policy impact: widely cited by Bengio, Hinton, and Russell as evidence that “frontier AI risk is real”

The DeepMind Mechanistic Interpretability team (led in part by Neel Nanda) has followed up on Anthropic’s Sparse Autoencoder research during 2024-2026:

  • Replicating the SAE method on Gemini 2.5
  • Publishing Gemini Scope (analogous to Gemma Scope 2024, an interpretability open-source kit targeting larger models)
  • Limitation: scaling SAEs to Gemini 3 Pro remains an open problem in compute cost

4. External red-team and evaluation partners

Section titled “4. External red-team and evaluation partners”
  • UK-based scheming / deception evaluation organization
  • Gemini 1.5 Pro, 2.5 Pro, and 3 Pro have all undergone Apollo evaluation
  • Evaluation results are partly incorporated into FSF Reports and partly released as independent Apollo reports

METR (Model Evaluation and Threat Research)

Section titled “METR (Model Evaluation and Threat Research)”
  • California non-profit focused on agentic capability evaluation
  • Gemini models are typically top-three in METR’s autonomous-task evaluations (shifting among Claude and GPT)
  • The METR Benchmark time-series comparisons (2023-2026) show Gemini exhibiting the fastest catch-up in agentic capability

Cybench / WMDP / GAIA and other shared benchmarks

Section titled “Cybench / WMDP / GAIA and other shared benchmarks”
  • Cybench (UK AISI + academic collaboration): autonomous cybersecurity capability
  • WMDP (Weapons of Mass Destruction Proxy): led by Dan Hendrycks of CAIS
  • GAIA (Meta + academic): general agentic capability
  • RealHarm / BrowseComp: newer agentic-scenario benchmarks

Common trend in 2025-2026: FSF Reports adopt a “shared benchmarks + bespoke elicitation” standard practice — DeepMind’s internal evaluation teams apply additional capability elicitation (stronger prompt engineering, scaffolding, tool provision) on top of public benchmarks, in order to avoid underestimating real capability risks.

  • Oxford Internet Institute (Luciano Floridi and others): ethical evaluation
  • Stanford HAI (Percy Liang and others): HELM benchmark integration
  • MIT CSAIL: joint interpretability research
  • Mila (Yoshua Bengio): alignment-research collaboration

Gemini jailbreak and vulnerability disclosure

Section titled “Gemini jailbreak and vulnerability disclosure”

DeepMind’s jailbreak disclosures are comparatively conservative — contrast Anthropic’s 2024 public “Many-shot Jailbreaking” paper. DeepMind’s practice:

  • The internal Vulnerability Reward Program (bug bounty) was extended to AI jailbreaks from 2024-06
  • Coordinated disclosure: finders first notify DeepMind and then decide on publication
  • Major 2024 events:
    • Gemini 1.5 Pro: long prompt + role-play bypass of safety (Anthropic and DeepMind were concurrently affected; joint remediation)
    • Gemini historical-image generation racial misplacement (not a traditional “jailbreak” but exposed model-behavior shortcomings)
  • 2025-2026 trend: Gemini 3 Pro’s prompt-injection robustness shows notable improvement on SEP-Bench (co-developed by Google and academia)

Ahmad et al. (2024) — external evaluator access

Section titled “Ahmad et al. (2024) — external evaluator access”

“Openness in Language Models” (GovAI Working Paper 2024) notes that external evaluation access at the three frontier labs remains restricted:

  • Most evaluation runs through the API rather than via model-weights access
  • Deep white-box evaluation (attention patterns, activation analysis) is not possible
  • AISI has a weights-access exception, but the academic community broadly does not

DeepMind is tied with Anthropic at the industry frontier on this dimension — opening weights access to UK/US AISI, while academic collaboration remains API-only.

Dan Hendrycks (CAIS) — benchmark saturation

Section titled “Dan Hendrycks (CAIS) — benchmark saturation”

Hendrycks has consistently warned that classic benchmarks such as MMLU / HumanEval / GSM8K are saturated, and that a new generation of benchmarks is needed (HLE, Humanity’s Last Exam; MMLU-Pro; WMDP). DeepMind’s FSF Report has incorporated HLE, GPQA Diamond, and WMDP as part of evaluation — responding to the academic call for benchmark evolution.

Bengio’s 2025 International AI Safety Report devotes a chapter to red-team disclosure:

  • Affirms the institutional value of FSF Reports
  • Criticizes FSF Reports for not yet including “probability estimates of residual risk” — only binary pass/fail, with no uncertainty quantification
  • Calls for “complete access for third-party evaluators + mandatory disclosure + cross-company comparability”

Geoffrey Hinton / Stuart Russell — beyond the report

Section titled “Geoffrey Hinton / Stuart Russell — beyond the report”

Hinton (2024 Nobel laureate) and Russell have repeatedly emphasized across forums that “red-team disclosures are necessary but insufficient” — societal effects post-deployment (elections, employment, scientific research) lie outside pre-deployment red-team evaluation. FSF Reports focus on frontier capability risk and are largely silent on systemic societal impact.

Across 2025 public writing and discussion, Rohin Shah has expressed a consistent self-critical position: red-team evaluation measures the capability ceiling a researcher can elicit under controlled conditions, but offers limited information about capabilities that actors with sustained resources may elicit under long-term post-deployment conditions.

This is one of the few public acknowledgments from within industry of methodological limits in evaluation.

DeepMind’s red-team ecosystem spans two organizations:

  • DeepMind Safety and Alignment Team (London, led by Pushmeet Kohli)
  • Google Trust & Safety (Mountain View, product-level red-team)
  • Responsible AI Council (cross-functional review)

In practice: FSF Report evaluations are carried out primarily by DeepMind Safety; product-level abuse testing (Gemini App content policy, AI Overviews factuality) is carried out primarily by Google Trust & Safety.

DeepMind’s operative definition of “elicitation” (FSF v2, Appendix):

  • No sandbagging: the model is not permitted to “deliberately underperform” under test
  • Best-of-N scaling: tools, scaffolding, and multiple sampling are permitted
  • Domain-expert prompt engineering: subject-matter experts (biologists, cybersecurity specialists) are engaged to design prompts
  • Post-training optimization: in certain domains small-scale fine-tuning followed by re-evaluation is permitted (simulating a “malicious fine-tune” scenario)

These practices are materially more rigorous than the pre-2023 “default-prompt evaluation” — but their cost is very high, within reach only of frontier labs plus AISI.

”Red-team information sharing” at the Frontier Model Forum

Section titled “”Red-team information sharing” at the Frontier Model Forum”

In 2024, FMF established a Vulnerability Sharing Working Group:

  • Its four founding members (Anthropic, Google, Microsoft, OpenAI) share high-risk jailbreak findings
  • The specific protocol is not public, but joint remediation has been successfully coordinated for the 2024 “many-shot” jailbreak and the 2025 “agentic prompt injection” vulnerability
  • Critique (Mozilla / open-source community): “industry-cartel-style information sharing” that excludes smaller labs and the open-source community

The special relationship between UK AISI and DeepMind

Section titled “The special relationship between UK AISI and DeepMind”

DeepMind is headquartered in London; UK AISI is based in London — geography and personnel circulation yield deep collaboration. Geoffrey Irving (formerly DeepMind, now UK AISI Chief Scientist) is the emblematic figure. This has also prompted “regulatory capture” concerns: a substantial share of AISI staff come from DeepMind / Anthropic, blurring the independence boundary.

DimensionGoogle DeepMindAnthropicOpenAI
Model-level safety reportFSF Reports (standalone)Risk Reports (from 2026-04)System Cards (embedded)
Academic publication frequencyHighest (Alignment team 20+/year)High (interpretability + RLHF)Moderate (declined after 2024 Superalignment dissolution)
AISI collaborationDeep with UK AISI (geography + personnel)UK/US AISIUK/US AISI (US reduced after EO 14179)
Apollo + METRYes (joint publication)YesSelective publication
Scheming evaluationJoint landmark paperJoint (DeepMind-led)Internal o-series research
Jailbreak disclosureConservative (internal coordination)Public many-shot paperSelective release
InterpretabilityGemini Scope + SAE replicationLeading on sparse autoencodersSparse autoencoders (output declined after 2024 dissolution)
  • 2018: Jan Leike Scalable Agent Alignment (DeepMind)
  • 2023-11: Bletchley Declaration + UK AISI founded
  • 2024-05: UK AISI first round (Gemini 1.5 Pro + Claude 3 Opus)
  • 2024-05: FSF v1
  • 2024-10: Scheming in Frontier AI Models joint paper
  • 2025-02: FSF v2
  • 2025-04: Gemini 2.5 FSF Report (first)
  • 2025-07: DeepMind manipulation-capability evaluation paper (foundation for the Harmful Manipulation CCL)
  • 2025-11: Gemini 3 Pro FSF Report
  • 2026-02: UK AISI annual report
  • 2026-04: FSF v3 (Harmful Manipulation CCL + TCL)