Model Card
2025-11 milestone: The Gemini 3 Pro Model Card was released concurrently with the FSF Report — the first industry instance of a paired “technical report + standalone safety report” model card format. Anthropic’s Claude Opus 4.7 Card (2025-12) subsequently adopted a similar structure.
Google’s origins in the model card paradigm
Section titled “Google’s origins in the model card paradigm”Google is one of the academic birthplaces of the model card paradigm. The 2019 FAT* paper by Raji, Gebru and others, “Model Cards for Model Reporting” (arxiv:1810.03993), proposed the model card standard, and Google’s internal PAIR and Ethical AI teams productized it around the same period. Yet a tension persists between this early academic contribution and the quality of Google’s own model cards — see the academic critique section below.
Evolution of Gemini technical reports
Section titled “Evolution of Gemini technical reports”| Version | Release date | Technical report / Model card | Key capabilities / milestones |
|---|---|---|---|
| Gemini 1.0 (Ultra/Pro/Nano) | 2023-12-06 | Gemini: A Family of Highly Capable Multimodal Models | First “natively multimodal” training; MMLU 90.0 (Ultra) |
| Gemini 1.5 Pro | 2024-02-15 | Gemini 1.5 Technical Report | Million-token context (extensible to 10M); MoE architecture |
| Gemini 1.5 Flash | 2024-05 | Updated 1.5 Report | Low-latency distillation |
| Gemini 2.0 Flash | 2024-12 | deepmind.google/technologies/gemini/flash/ | First “agentic-first” design (Project Astra / Mariner) |
| Gemini 2.5 Pro | 2025-03 | Gemini 2.5 Technical Report | Deep Think reasoning mode; MMMU 84.0 |
| Gemini 2.5 Flash | 2025-04 | Same | ”Thinking budget” toggle |
| Gemini 2.5 FSF Report | 2025-04 | First paired safety document | See safety-framework |
| Gemini 3 Pro | 2025-11 | Gemini 3 Pro Model Card + standalone FSF Report | Agent-core; GPQA Diamond / SWE-bench Verified aligned with contemporary frontier (exact scores per official Model Card) |
| Gemini 3 Ultra | 2026-Q1 (expected) | Announced | — |
| Gemini 3.5 series | 2026-Q2 (expected) | — | — |
Structure of the 2025-11 Gemini 3 Pro Model Card
Section titled “Structure of the 2025-11 Gemini 3 Pro Model Card”The Gemini 3 Pro Model Card is among the most comprehensive model cards released to date in industry, organized in seven sections:
- Model Summary: architecture (sparse MoE variant), training compute (disclosed as “greater than Gemini 2.5 Pro”; specific FLOP figures are not disclosed, continuing Google’s practice of withholding precise compute), training cutoff.
- Capability Evaluations:
- MMLU / MMLU-Pro (knowledge)
- GPQA Diamond (research-level scientific QA)
- AIME 2024 / 2025 (mathematics)
- MATH-500 (foundational mathematics)
- LiveCodeBench (programming, continuously updated)
- SWE-bench Verified (real-world code modification tasks)
- MMMU / MathVista (multimodal)
- Video-MME (video understanding)
- HumanEval / MBPP (classic programming)
- Long Context: needle-in-haystack (text/video/code tri-modal) achieving near-perfect recall at 1M-2M tokens.
- Tool Use & Agents: agentic evaluations including TAU-bench, GAIA, Browsecomp.
- Multilinguality: cross-lingual MMLU-Pro / MGSM across 140+ languages.
- Known Limitations: hallucination tendency, weak temporal reasoning, domain-specific biases (gender / race / religion).
- Responsible Deployment: SynthID watermarking, content moderation, refusal policies, FSF Report cross-references.
Comparison with the Gemini 1.0 Model Card
Section titled “Comparison with the Gemini 1.0 Model Card”The 2023-12 Gemini 1.0 Model Card was 8 pages and focused mainly on static benchmarks such as MMLU / GSM8K / HumanEval. The Gemini 3 Pro Model Card runs to roughly 60 pages, with additions including:
- Agentic behavior evaluations (tool use, multi-turn planning, long-horizon tasks)
- Honesty and deception evaluations (MACHIAVELLI / TruthfulQA / DeepMind’s in-house Sycophancy benchmark)
- Sandbagging detection (whether the model deliberately underperforms)
- Dangerous Capability subset (quantitative figures provided in the FSF Report)
- Quantified cultural bias (BBQ / CrowS-Pairs + DeepMind’s in-house GlobalBias)
Core capability narrative
Section titled “Core capability narrative”1M-10M context (the Gemini 1.5 differentiator)
Section titled “1M-10M context (the Gemini 1.5 differentiator)”The 1M-10M context window of Gemini 1.5 Pro was an engineering victory built on DeepMind’s work in MoE + Ring-Attention-class algorithms. The technical report disclosed over 99% recall on a 2M-token code-repository haystack and, for 10M-token video haystacks sampled at 1 fps, “30-second clip localization”.
Academic critique: A 2024 “Context Rot” study by the Chroma team found that reasoning quality in Gemini 1.5 degrades noticeably beyond 200K tokens (though retrieval remains accurate). DeepMind partially responded in the Gemini 2.5 / 3 Pro reports — introducing an “Effective Context” metric rather than relying solely on “Haystack Recall”.
Deep Think (the Gemini 2.5 reasoning paradigm)
Section titled “Deep Think (the Gemini 2.5 reasoning paradigm)”Gemini 2.5 Pro introduced Deep Think in 2025-03 — long chains-of-thought plus multi-path search and self-verification. On math benchmarks such as AIME 2024, its results sit in the same tier as the contemporaneous o1 series and Claude 3.7 Sonnet (specific scores per technical report). Rohin Shah (DeepMind Alignment), in public research discussion during 2025, noted that Deep Think training incorporated scheming evaluations as counter-example data, reducing the model’s tendency toward “self-rationalizing deception” in long CoT.
Agentic (the Gemini 3 Pro positioning)
Section titled “Agentic (the Gemini 3 Pro positioning)”Gemini 3 Pro is explicitly positioned as an “agent-first model”:
- Project Mariner (browser agent) co-evolved with Gemini 3 Pro
- Project Astra (real-time multimodal assistant)
- Native Function Calling / Code Execution integration in Google AI Studio
- SWE-bench Verified performance sits in the same frontier band as Claude Opus 4.x and GPT-5 (specific scores per each vendor’s official Model Card)
Academic critique and methodological disputes
Section titled “Academic critique and methodological disputes”Raji & Gebru on the implementation of the model card standard
Section titled “Raji & Gebru on the implementation of the model card standard”In a 2024 FAccT Retrospective, Raji assessed model card implementation across eight major labs. Her observation was that Google was the earliest adopter but its quality has varied in recent years — she highlighted that during 2023-2024 Google had released extremely abbreviated PaLM 2 model cards (fewer than five pages, a sharp contrast with OpenAI’s contemporaneous GPT-4 System Card). Quality improved markedly from Gemini 2.5 onward in 2025, though training data disclosure remains insufficient.
Benchmark contamination
Section titled “Benchmark contamination”In Q3 2025, the Anthropic Alignment Team published commentary noting that LiveCodeBench and SWE-bench Verified face contamination risks in frontier-model training sets (“Evaluation Integrity in Frontier Models”). DeepMind’s response in the Gemini 3 Pro Card: using the “contamination-free split” of SWE-bench Verified together with LiveCodeBench Pro (time slices strictly later than the training cutoff).
François Chollet (ARC-AGI) and the anti-benchmark position
Section titled “François Chollet (ARC-AGI) and the anti-benchmark position”Chollet released ARC-AGI-2 / ARC-AGI-3 in 2024-2025. Even with Deep Think, Gemini 2.5 remains orders of magnitude below human performance on these benchmarks (specific scores per the official ARC Prize Foundation leaderboard). Chollet’s central argument is that knowledge benchmarks such as MMLU / GPQA are saturated, and that genuine abstract generalization remains far from solved. Google DeepMind’s Gemini 3 Pro Card cites ARC-AGI-2 as a limitations indicator for the first time — a notable shift in posture.
Gary Marcus’s sustained critique
Section titled “Gary Marcus’s sustained critique”Marcus has long criticized LLM model cards for “substituting benchmark scores for genuine understanding”. In his Marcus on AI posts following the 2026 Gemini 3 launch, he has continued to argue that even with Gemini 3 Pro’s strong performance on knowledge benchmarks such as GPQA, in scenarios requiring genuine retrieval — such as legal and medical citations — it still produces a material proportion of fabricated citations (independent third-party evaluations comparing it with Claude Opus 4.x show problems of comparable magnitude). This reflects the deeper limits of the “benchmark-is-everything” model card paradigm.
Shane Legg’s (DeepMind co-founder / Chief AGI Scientist) evaluation philosophy
Section titled “Shane Legg’s (DeepMind co-founder / Chief AGI Scientist) evaluation philosophy”Legg has expressed a consistent public position across multiple venues: model cards should not be treated as marketing collateral, and evaluations should proactively cover the tasks at which a model is most likely to fail, rather than merely selecting benchmarks on which the model excels. This amounts to a public expression of internal recognition of the benchmark-arms-race problem. The Rohin Shah team subsequently added a “capability misalignment” section to the Gemini 3 Pro Card — a proactive disclosure of what the model can do vs. what it should do.
Training data disclosure and compliance
Section titled “Training data disclosure and compliance”EU AI Act Art. 53(1)(d) training summary obligation
Section titled “EU AI Act Art. 53(1)(d) training summary obligation”As a “systemic risk GPAI” (compute >> 10²⁵ FLOP), Gemini 3 Pro is required to publish a training data summary. Following Google’s signing of the GPAI CoP in 2025-08, the summary was released according to the CoP Transparency Chapter template:
- Public web crawl (subject to robots.txt + Web & App Controls)
- Public YouTube video (over creator protest, with figures including MrBeast publicly opposing)
- Compliant subsets of Google Books / Scholar
- Synthetic data (including Gemini 2.5 self-supervised generation)
- Third-party licensed corpora
Undisclosed: specific source proportions, the ratio of synthetic to real data, and the specific logic for identifying copyrighted content.
Data opt-out mechanisms
Section titled “Data opt-out mechanisms”- Web crawl: disabled via the
Google-Extendeduser-agent in robots.txt (launched 2023-09) - Google user data: disabled via Web & App Activity Controls
- YouTube creators: a third-party training opt-out was added in 2024-07 (applies only to non-Google models; Google’s own Gemini training continues to use the data)
- Workspace user content: not used for training by default (enterprise privacy commitment)
Critique: the EFF and Mozilla Internet Health Report 2025 note that the Google-Extended opt-out is not retroactive
— previously crawled data is not deleted.
Industry practitioner perspective
Section titled “Industry practitioner perspective”The AIPR (AI Principles Review) process
Section titled “The AIPR (AI Principles Review) process”AIPR is Google’s cross-product internal AI review process:
- Research teams submit AIPR proposals
- Cross-functional review by Google Responsible AI + DeepMind Safety + Legal + PM
- Decisions are recorded internally; model cards disclose only conclusory judgments that have cleared AIPR
- AIPR underwent substantial scale-up during 2024-2025, though process transparency remains low (one of the legacy effects of the Timnit Gebru episode)
The Vertex AI Model Garden governance layer
Section titled “The Vertex AI Model Garden governance layer”When enterprise customers access Gemini through Vertex AI Model Garden:
- They automatically receive a Model Card Summary (simplified version)
- They can download the full Model Card PDF + FSF Report
- The Responsible AI Toolkit provides safety classifiers, grounding, and citation checks
This represents industry’s deepest enterprise integration of the model card to date — the model card is not merely documentation but part of the SDK interface.
Documentation layering across Gemini App / AI Studio / Vertex
Section titled “Documentation layering across Gemini App / AI Studio / Vertex”| Deployment surface | Visible documentation | Update cadence |
|---|---|---|
| Gemini consumer app | Simplified “About Gemini” | With product updates |
| Google AI Studio | Model Card summary + FSF Report links | With model versions |
| Vertex AI (enterprise) | Full Model Card PDF + Responsible AI Toolkit | With model versions |
| Academic / external eval | Full Technical Report on arXiv | At release |
Peer comparison
Section titled “Peer comparison”| Dimension | Gemini 3 Pro (2025-11) | Claude Opus 4.7 (2025-12) | GPT-5 (2025-08) |
|---|---|---|---|
| Technical report length | ~60 pages | ~50 pages | ~40 pages |
| Training compute disclosure | Specific FLOP not disclosed | Not disclosed | Not disclosed |
| Training data summary | EU CoP template | EU CoP template | EU CoP template (partially withheld) |
| Safety report separation | Yes (FSF Report) | Yes (Risk Reports) | Embedded in System Card |
| Agentic evaluation depth | Deepest | Deep | Deep |
| Benchmark contamination disclosure | Yes | Yes | Partial |
| Bias evaluation | BBQ + GlobalBias | BBQ + Anthropic Constitution eval | BBQ + System Card |
Key timeline
Section titled “Key timeline”- 2019: Raji & Gebru Model Cards paper (with Google employee authorship)
- 2023-12: Gemini 1.0 Technical Report
- 2024-02: Gemini 1.5, million-token context
- 2024-12: Gemini 2.0 Flash, agentic debut
- 2025-03: Gemini 2.5 Pro + Deep Think
- 2025-04: Gemini 2.5 FSF Report (first model card with paired safety report)
- 2025-11: Gemini 3 Pro Model Card + FSF Report
- 2026-Q1: Gemini 3 Ultra expected
- 2026-Q2: Gemini 3.5 series expected
Cross-links
Section titled “Cross-links”- Company-level overview: Google DeepMind index
- Usage policy and AI Principles: Usage Policy
- FSF framework and Reports: Safety Framework
- Red-teaming and external evaluation: Red-Team Disclosures
- Comparison: Claude Model Cards
- EU AI Act Art. 53 training summary obligation: EU AI Act