Data and Training
Why this topic matters
Section titled “Why this topic matters”Training data is the central battleground of LLM-era AI governance: data determines capability, copyright exposure, personal-information compliance, and cross-border compliance. The three jurisdictions diverge most on this issue, and litigation and legislation are proceeding in parallel.
The four core questions
Section titled “The four core questions”- Whose data can be used? (legal basis / copyright / consent)
- How is data quality assured? (representativeness, bias, quality requirements)
- Must it be disclosed? (training-data summary, transparency)
- How is cross-border processing handled? (data export, the Brussels Effect)
Snapshot
Section titled “Snapshot”| Dimension | China | United States | EU |
|---|---|---|---|
| Comprehensive privacy law | PIPL (2021) | No federal law (state-law patchwork) | GDPR (2018) |
| AI-specific data provisions | Generative AI Interim Measures art. 7 | None | AI Act art. 10 |
| Mandatory training-data summary | None | None | AI Act art. 53 — world’s first |
| Copyright treatment | No express TDM exception | Fair-use defence (in litigation) | DSM TDM exception (opt-out permitted) |
| Children’s data | PIPL art. 31 | COPPA (federal) | GDPR art. 8 |
| Biometrics | PIPL art. 28 | Illinois BIPA and other state laws | GDPR art. 9 |
| Cross-border | PIPL + DSL + CSL | Permissive after EO 14179 | Schrems II + DPF |
Scholarly debates
Section titled “Scholarly debates”Foundational literature
Section titled “Foundational literature”- Solove (2006), “A Taxonomy of Privacy”: the theoretical bedrock of privacy protection.
- Westin (1967), Privacy and Freedom: a classic of informational privacy.
- Nissenbaum (2010), Privacy in Context: the contextual-integrity theory — does training an LLM “repurpose” data in a way that violates context?
Training data × personal information
Section titled “Training data × personal information”- Kaminski (Colorado) & Urban (Berkeley) (2021), “The Right to Contest AI”.
- Edwards & Veale (2017), “Slave to the Algorithm”.
- Helberger, Diakopoulos et al. (2023): a systematic review of data governance in the AI Act.
- Leiser & Schneider (2024), “Scraping and Fair Use in AI Training”.
- CNIL’s two 2025 core recommendations: GDPR applies across the AI system lifecycle (the authoritative French CNIL position).
Training data × copyright
Section titled “Training data × copyright”- Lemley & Casey (2021), “Fair Learning” (Texas Law Review): argues for fair use on LLM training.
- Henderson, Li, Jurafsky et al. (2023), “Foundation Models and Fair Use” (Stanford CRFM): a technical-perspective fair-use analysis.
- Samuelson (Berkeley): sustained critical work on AI copyright.
- Grimmelmann (Cornell): continuing analysis of NYT v. OpenAI and related litigation.
- Matthew Sag, “The New Legal Landscape for Text Mining and Machine Learning” (Journal of Copyright Society, 2019).
- Lehman & Johnson: analysis of copyright ownership in generative-AI outputs.
Data quality and bias
Section titled “Data quality and bias”- Bender, Gebru, McMillan-Major, Shmitchell (2021), “On the Dangers of Stochastic Parrots” (one of the most influential FAccT papers) — bias, energy use, and responsibility in LLM training data.
- Birhane, Prabhu et al., “Multimodal datasets: misogyny, pornography, and malignant stereotypes”.
- Buolamwini & Gebru (2018), “Gender Shades”: the foundational paper on facial-recognition bias.
- AI Now Institute reports: structural bias in training datasets.
Training-data disclosure and transparency
Section titled “Training-data disclosure and transparency”- Gebru, Morgenstern et al. (2021), “Datasheets for Datasets”: dataset-documentation standards.
- Mitchell et al. (2019), “Model Cards for Model Reporting”.
- Diakopoulos & Koliska (2017): systematic work on algorithmic transparency.
- Almada & Petit (2023): analysis of what “sufficiently detailed summary” means in AI Act art. 53.
Cross-border data and AI
Section titled “Cross-border data and AI”- Kuner (Oxford): classic work on cross-border data flows.
- Greenleaf: ongoing tracking of Asia-Pacific data protection (including international comparison of China’s PIPL).
- Matt Sheehan (Carnegie): continuing analysis of the implementation of China’s Measures on Security Assessment of Cross-Border Data Transfers.
Core controversies
Section titled “Core controversies”1. “Publicly accessible” ≠ “usable for training”
Section titled “1. “Publicly accessible” ≠ “usable for training””- EDPB Dec 2024 Opinion on ChatGPT: legitimate interest can serve as a basis for training, but requires a rigorous LIA.
- CNIL 2025 guidance: being public does not equate to lawful processing, especially for special categories (biometrics, political views, etc.).
- Hamburg DPA (2024): model weights do not constitute personal-data processing (a contested position).
- United States: the fair-use defence + hiQ Labs v. LinkedIn precedent.
- China PIPL art. 13: processing of public information must be “within a reasonable scope”; multiple 2024 cases concerned training-data scraping.
2. The machine-readability of copyright opt-outs
Section titled “2. The machine-readability of copyright opt-outs”- DSM art. 4: commercial TDM permitted, but rightsholders may opt out.
- In practice:
robots.txt/ai.txt/ the TDMRep standard / the 2025 RSL (Robots Exclusion Standard for LLMs). - AI Act art. 53: GPAI providers must respect opt-outs (even for models trained outside the EU).
- Controversy: is a machine-readable “reservation of rights” legally effective? Most civil-law jurisdictions have not explicitly confirmed.
3. Implementing a “right to deletion” after training
Section titled “3. Implementing a “right to deletion” after training”- GDPR art. 17 / PIPL art. 47: data subjects may request deletion.
- Technical obstacles: model weights cannot be separately deleted; machine-unlearning research is far from operational.
- Positional splits:
- Hamburg DPA 2024: weights do not constitute personal data → no deletion duty triggered.
- Italy Garante: enforcement action against OpenAI requiring user-data deletion and correction rights.
- CNIL 2025: requires deletion feasibility to be considered at the design stage.
4. The “level of detail” of a training-data summary
Section titled “4. The “level of detail” of a training-data summary”- AI Act art. 53: requires a “sufficiently detailed” summary; template issued by the AI Office in Jul 2024.
- Almada & Petit (2023) critique: the template is flexible enough that GPAI providers can offer minimal disclosure.
- In practice: OpenAI / Anthropic / Google DeepMind began disclosing “categories of data sources” in 2025–2026 but not specific datasets.
Industry-practice lens
Section titled “Industry-practice lens”Training-data strategies of leading AI companies
Section titled “Training-data strategies of leading AI companies”| Company | Principal sources | Open disclosure | Copyright licensing | Response to EU art. 53 |
|---|---|---|---|---|
| OpenAI | Web + licensing (FT, Axel Springer, News Corp) | Low | Multiple paid licences | Plans to publish data summary |
| Anthropic | Web + licensed | Medium (partial detail) | Fewer public licence deals | Data summary published |
| Google DeepMind | Google Search index + YouTube + licences | Low | Pre-existing media relationships | Publication planned |
| Meta Llama | Web + Meta platform data (disputed) | Technical reports include considerable detail | Few paid licences | Already disclosed |
| Mistral | Web + partners | Medium | EU copyright compliance | Published (GPAI CoP requirement) |
| Alibaba Qwen | Chinese-first + partners | Technical reports include detail | Unclear | No CoP signature |
| Baidu ERNIE | Chinese-first + partners | Partial | Unclear | No CoP signature |
| DeepSeek | Web | Detailed technical reports | Unclear | No CoP signature |
Notable litigation (2023–2026)
Section titled “Notable litigation (2023–2026)”- NYT v. OpenAI & Microsoft (Dec 2023 onward): the most closely watched copyright lawsuit; ongoing in 2026.
- Bartz v. Anthropic / Kadrey v. Meta / Tremblay v. OpenAI: authors’ class actions.
- Andersen v. Stability AI: text-to-image copyright.
- Getty Images v. Stability AI (UK + US): image copyright.
- Clearview AI: worldwide BIPA enforcement (Illinois) + multiple national DPAs.
Notable DPA actions (2023–2026)
Section titled “Notable DPA actions (2023–2026)”- Italy Garante vs ChatGPT (Mar 2023 ban; Dec 2024 €15M fine).
- Italy Garante vs Replika (Feb 2023 order, AI companionship × minors).
- Italy Garante vs DeepSeek (Jan 2025 provisional ban).
- CNIL vs Clearview AI (2022 €20M fine).
- DPC Ireland’s investigation of Meta LLaMA.
- Hamburg DPA’s contested position on weights (2024).
Key changes in Q1 2026
Section titled “Key changes in Q1 2026”- AI Act art. 53 training-data-summary obligation took effect on Aug 2, 2025; the first GPAI providers began disclosing in Q1 2026.
- GPAI Code of Practice Chapter 2 (Copyright) signed Aug 1, 2025; Meta publicly dissented from the copyright chapter.
- US fair-use litigation continues, with some courts preliminarily ruling in favour of OpenAI (Q1 2026: partial win for Anthropic in Bartz v. Anthropic).
Related rules and companies
Section titled “Related rules and companies”Related rules
Section titled “Related rules”- EU: GDPR; AI Act arts. 10 / 53; GPAI Code of Practice Chapter 2; DSM Copyright Directive.
- China: PIPL, DSL, CSL; Generative AI Interim Measures art. 7; TC260-003-2024 (≤ 5% unlawful content in training corpora).
- United States: NIST AI RMF; state laws (CCPA/CPRA, Illinois BIPA, Tennessee ELVIS Act, etc.); no federal AI-specific data law.
Related companies
Section titled “Related companies”See the “training data” analysis on individual company pages: OpenAI / Anthropic / Mistral (EU-compliance model) / DeepSeek (open-source-transparency model) / Alibaba Qwen (Chinese + open).