Skip to content

EU — Data and Training

RuleRelationship to training data
GDPRLawful basis for personal data; purpose limitation
EU AI ActArt. 10 data governance + art. 53 training-data summary
GPAI Code of PracticeChapter 2 copyright + standardised training-data summary
DSM Copyright Directive (2019/790)TDM exception (arts. 3, 4)
DSAVLOP visibility of content used for training
France CNIL AIGDPR × AI training-data compliance guidance (12+ papers)
  • Lawful basis (art. 6): consent / contract / legal obligation / legitimate interest (commonly used).
  • Special categories (art. 9): biometrics, health, political views — stricter.
  • Purpose limitation (art. 5): can the original purpose of collection cover “training an AI model”? Still contested.
  • Transparency (arts. 13–14): notification to data subjects.
  • Right to erasure / right to be forgotten (art. 17): how can data already embedded in model weights be “deleted”?

DPA actions:

  • Garante (IT) vs ChatGPT (Mar 2023 ban / fine).
  • CNIL (FR) AI action plan (multiple guidance papers).
  • DPC (IE) vs Meta LLaMA.
  • Hamburg DPA — position on whether weight embedding constitutes personal-data processing.

For training / validation / test data in high-risk AI systems:

  • Relevant, sufficiently representative, free of errors, complete.
  • Consider the geographic, behavioural, and functional characteristics of the intended use.
  • Bias detection and mitigation.
  • Special-category data: may be processed to mitigate bias, but must comply with GDPR exceptions.

Art. 10 runs in parallel with GDPR, not in place of it. Companies need two sets of documentation.

Art. 3: TDM for research purposes by research institutions cannot be excluded (research exception).

Art. 4: Commercial TDM may be excluded by the rightsholder via a machine-readable opt-out.

  • The machine-readable form of the “reservation of rights” is the focal controversy.
  • robots.txt, ai.txt, and the TDMRep standard are all candidates.

AI Act art. 53 references DSM art. 4: GPAI providers must respect rightsholders’ opt-outs.

All GPAI providers:

  • Publish a “sufficiently detailed” summary of training data.
  • Template issued by the AI Office in Jul 2024.
  • Contents: primary data-source categories (web / licensed / user-generated / synthetic, etc.), languages, approximate scale.
  • Does not require listing each dataset.

This is the world’s first mandatory training-data disclosure requirement.

CNIL, Hamburg DPA, and the EDPB’s Dec 2024 opinion:

  • “Publicly accessible” is not “lawfully usable”.
  • The legitimate-interest route requires a three-step test (LIA).
  • Special-category data (art. 9) cannot, in principle, be processed merely because it is public.

EDPB Dec 2024 opinion on legitimate interest for ChatGPT-style models:

  • Legitimate interest can serve as a basis, but requires a rigorous LIA.
  • Data-subject objections (art. 21) may demand deletion or cessation of use.
Section titled “3. Machine-readability of copyright opt-outs”

In current practice:

  • Some media companies publish opt-outs via robots.txt / ai.txt.
  • Major publishers sign licensing agreements (OpenAI × Axel Springer, Financial Times, News Corp, etc.).
  • 2025 RSL (Robots Exclusion Standard for LLMs) is in development.
  • The “right to be forgotten” (GDPR art. 17) is technically difficult for trained models.
  • Machine unlearning research lags behind legal expectations.
  • Hamburg DPA 2024 position: model weights do not constitute “personal data” (contested).
DimensionEUChinaUS
Comprehensive privacy lawGDPRPIPLNone (state-law patchwork)
AI-specific data provisionsAI Act art. 10Generative AI Interim Measures art. 7None
Mandatory training-data summaryYesNoNo
Copyright TDM exceptionYes, opt-out permittedNo express provisionFair-use defence
Children’s dataGDPR art. 8PIPL art. 31COPPA
BiometricsGDPR art. 9PIPL art. 28Illinois BIPA and other state laws

The EU has the most systematised training-data governance among the three jurisdictions.