High-Risk AI · Chapter III

Article 10: Data and Data Governance

Article 10 of the EU AI Act requires providers of high-risk AI systems to implement data governance and management practices covering the training, validation, and testing datasets used to build their systems. It is one of the most practically demanding articles in the regulation — requiring documentation of not just what data was used, but how it was collected, prepared, assessed for quality, and examined for biases.

Applies to: Providers of high-risk AI systems listed in Annex III. Requirements cover all three dataset types: training, validation, and testing. Providers using pre-trained foundation models must document available training data information and assess known biases in that data.

What Article 10 Actually Requires

Article 10(1) establishes the core obligation: training, validation, and testing data must be subject to appropriate data governance and management practices suited to the system's intended purpose.

Article 10(3) adds a quality floor: datasets must be relevant, sufficiently representative, and free from errors and complete to the extent possible given the state of the art. These are substantive quality criteria — it is not enough to have a governance process; the data must actually meet the standard.

The Six Data Governance Requirements

Article 10(2) specifies six categories that data governance practices must address. Each must be documented in the technical documentation required under Article 11.

Art. 10(2)(a)

Design choices

Document the relevant data categories used for training and the rationale for using them, including how those choices relate to the system's intended purpose.

Art. 10(2)(b)

Data collection processes

Describe how data was collected, the source of each dataset, and the provenance of any third-party datasets — including any conditions or restrictions on their use.

Art. 10(2)(c)

Data preparation

Document labelling procedures, instructions given to human labellers, and any automated pre-processing, filtering, or augmentation steps applied to the data.

Art. 10(2)(d)

Statistical assumptions

Identify and document the statistical assumptions underlying the system's design, including assumptions about the representativeness of the training population relative to the deployment population.

Art. 10(2)(e)

Availability, quantity, and suitability assessment

Assess whether sufficient data of adequate quality is available for the intended purpose, and document any known gaps in data availability and their effect on system performance.

Art. 10(2)(f)

Examination for biases

Examine training, validation, and testing datasets for possible biases that could lead to prohibited discrimination or harmful outputs — and document the examination methodology, findings, and any mitigations applied.

Data Quality: What “Relevant, Representative, and Error-Free” Means

Article 10(3) requires that all three dataset types satisfy four quality criteria. These apply in parallel with the governance process requirements above — both must be satisfied.

  • Relevant: Data must relate to the system's intended purpose and deployment context. Data collected for a different application may not satisfy this requirement even if structurally similar.
  • Sufficiently representative: Datasets must adequately reflect the population the system will be applied to, including all relevant subgroups. Demographic underrepresentation is a data quality failure, not merely a fairness concern.
  • Free from errors: The standard is 'to the extent possible' — zero errors are not required. What is required is that known errors are identified, the error rate is within acceptable bounds, and systematic errors are corrected or documented.
  • Complete: The dataset must cover the intended operating domain sufficiently. Known data gaps must be documented and their effect on system performance assessed explicitly.

Bias Examination — The Most Commonly Missed Requirement

Article 10(2)(f) requires examination of all three datasets for possible biases — particularly where data reflects past human decisions, contains errors, or reflects socioeconomic disparities. These are mandatory examinations that must be documented, not optional fairness evaluations.

The examination should cover both statistical bias (underrepresentation of certain groups) and historical bias (patterns reflecting past discriminatory decisions). A hiring model trained on historical acceptance decisions from an organisation with a historically homogeneous workforce encodes that pattern — examining for this bias is required, and the examination findings and any mitigations must be documented explicitly.

Where biases are found and cannot be fully eliminated, the residual bias, its effect on outputs, and the mitigations in place must all be documented. This links directly to the Article 9 risk management system, where identified biases must be treated as risks requiring assessment.

Common Mistakes

Documenting training data only, not validation and test sets

Article 10 applies equally to all three datasets. Providers often focus on training data and neglect to apply the same quality criteria and governance documentation to validation and test splits.

Treating bias examination as an internal evaluation, not a documentation requirement

Many teams run bias evaluations during development but do not formally document them. Article 10 requires the examination and findings to be recorded in the technical documentation — an undocumented evaluation does not satisfy the requirement.

No documentation of known data gaps

Most real datasets have known limitations — geographic gaps, demographic underrepresentation, or time-bounded collection. These gaps must be documented explicitly, including their likely effect on system performance.

Assuming GDPR compliance covers Article 10

GDPR governs the lawful processing of personal data. Article 10 governs the quality and governance of data used to train AI systems. Compliance with GDPR does not satisfy Article 10.

Generate your Article 10 documentation

Nytivo's Article 10 module guides you through each data governance requirement and generates the data documentation section required for Annex IV technical documentation.

Start free trial
FAQ

Article 10 — Frequently Asked Questions

Does Article 10 apply if we use a pre-trained foundation model?

Yes, but the obligations shift depending on your role. If you fine-tune a foundation model on your own data, Article 10 applies to that fine-tuning dataset fully. If you use the model without fine-tuning, you must still document available information about the model's training data and assess known biases — information the GPAI model provider is required to supply under Article 53.

What does the 'bias examination' actually require in practice?

Article 10(2)(f) requires an examination of all three datasets for possible biases, particularly where data reflects past human decisions, contains errors, or reflects socioeconomic disparities. The examination and its findings must be documented. For a hiring AI, this typically means testing for demographic disparities in model outputs across protected attributes and documenting the approach, findings, and any corrective adjustments.

Does Article 10 cover validation and test data, or only training data?

All three. Article 10(1) explicitly covers training, validation, and testing datasets equally. In practice, validation and test sets receive less documentation attention than training data — this is an error that market surveillance authorities will flag.

What if our training data includes special categories of personal data?

Article 10(5) permits processing special categories of personal data (race, health, biometrics, etc.) strictly for bias monitoring and correction purposes, subject to appropriate safeguards including access controls and pseudonymisation. This is a specific derogation — it does not override GDPR and does not create a general lawful basis for using sensitive data in training.

How does Article 10 relate to Article 9?

They are closely linked. Data quality problems identified under Article 10 — including known biases, representativeness gaps, or limited data availability — are risks that must feed into the Article 9 risk management system. The two documentation sets should cross-reference each other explicitly.

Article 10 compliance by industry

Data governance requirements vary across sectors — different data types, representativeness standards, and bias risks apply depending on your Annex III category.