China — Data and Training
Relevant rules
Section titled “Relevant rules”| Rule | Relationship to training data |
|---|---|
| PIPL (2021) | Lawful basis for personal information; sensitive data; cross-border |
| CSL (2017) | Data localisation (CII) |
| DSL (2021) | Important-data risk assessment and export |
| TC260-003-2024 | 5% threshold for corpus lawfulness; source diversity |
| Generative AI Interim Measures (2023) | Article 7, the training-data-specific provision |
Three layers of training-data compliance in China
Section titled “Three layers of training-data compliance in China”Law layer (PIPL + DSL + CSL)
Section titled “Law layer (PIPL + DSL + CSL)”- Personal information: lawful basis, notice and consent, separate consent for sensitive information, minors.
- Important data: identification, risk assessment, export review.
- Data localisation: data collected domestically by CII operators must, in principle, remain stored domestically.
Departmental-rule layer (Generative AI Interim Measures, art. 7)
Section titled “Departmental-rule layer (Generative AI Interim Measures, art. 7)”《生成式人工智能服务管理暂行办法》第七条: “生成式人工智能服务提供者应当依法开展预训练、优化训练等训练数据处理活动,遵守以下规定: (一)使用具有合法来源的数据和基础模型; (二)涉及知识产权的,不得侵害他人依法享有的知识产权; (三)涉及个人信息的,应当取得个人同意或者符合法律、行政法规规定的其他情形; (四)采取有效措施提高训练数据质量,增强训练数据的真实性、准确性、客观性、多样性; (五)《网络安全法》《数据安全法》《个人信息保护法》等法律、行政法规的其他有关规定和有关主管部门的相关监管要求。”
— i.e., generative AI service providers must conduct pre-training, fine-tuning, and related data-processing activities in accordance with the law, and must: (1) use data and foundation models with a lawful source; (2) not infringe the intellectual property rights of others; (3) where personal information is involved, obtain consent or comply with other circumstances provided in law or administrative regulations; (4) take effective measures to improve the quality of training data, strengthening its authenticity, accuracy, objectivity, and diversity; (5) comply with all relevant provisions of the Cybersecurity Law, the Data Security Law, the Personal Information Protection Law, and other laws and regulations, and with requirements from competent authorities.
Technical-standard layer (TC260-003-2024)
Section titled “Technical-standard layer (TC260-003-2024)”Core quantitative requirements:
- Corpus sampling of 4,000 items; unlawful or harmful content must be ≤ 5%.
- Source diversity (Chinese / English).
- Annotator training and testing.
- Coverage across 31 risk categories.
Special issues
Section titled “Special issues”1. Lawful source and publicly accessible data
Section titled “1. Lawful source and publicly accessible data”The stacking of “lawful source” in art. 7 of the Generative AI Interim Measures and “lawful basis” in art. 13 of PIPL:
- Publicly accessible (web scraping) ≠ usable for training.
- User-generated content (UGC): training-authorisation clauses in user agreements are increasingly common, but whether they satisfy PIPL’s separate-consent requirement is disputed.
2. Intellectual property
Section titled “2. Intellectual property”- Copyright: the revised Copyright Law has not expressly introduced a “text and data mining (TDM) exception”.
- In practice: copyright lawsuits triggered by training data (e.g., a 2024 Shanghai case involving a text-to-image model).
- The contours of fair use are unclear.
3. Cross-border data
Section titled “3. Cross-border data”Cross-border scenarios for training data:
- Domestic training / foreign storage: data export (security assessment / SCCs / certification — three routes).
- Foreign training / serving domestic users: the foreign model itself is not directly covered, but onshore service launch requires a separate assessment.
- Cross-border sharing of training datasets: may trigger the blocking provision in art. 36 of the DSL.
4. Synthetic data and human feedback (RLHF)
Section titled “4. Synthetic data and human feedback (RLHF)”- Synthetic data: model-generated training data — not specifically regulated; authenticity and diversity requirements still apply.
- RLHF annotation data: TC260-003 has a dedicated section; annotators’ labour and data rights have drawn independent attention.
Comparison with the EU and the US
Section titled “Comparison with the EU and the US”| Dimension | China | EU | US |
|---|---|---|---|
| Personal data | PIPL separate consent (sensitive) + lawful basis | GDPR lawful basis + art. 9 special categories | No federal privacy law (state patchwork) |
| Copyright / TDM | No express TDM exception | DSM TDM exception (opt-out mechanism) | Handled under fair-use doctrine |
| Data quality | TC260-003 quantitative (5% / 90%) | AI Act art. 10 quality requirements | NIST RMF principles |
| Cross-border | PIPL + DSL + CSL triple mechanism | Schrems II + DPF | Permissive after EO 14179 |
| Training-data summary | No mandatory disclosure | AI Act art. 53 mandatory | None |