Article 11 and Annex IV require providers of high-risk AI systems to document how models were developed and selected. Experiment tracking and reproducibility together provide the evidence that a competent authority or notified body needs to verify the claims in the AISDP.
The EU AI Act's conformity assessment process requires organisations to explain how a deployed model was selected from among candidate models and to reproduce specific training runs on request. Experiment tracking captures the full specification of each run, including code version, data version, hyperparameters, random seed, compute environment, training metrics, and final evaluation results. Tools such as MLflow Tracking and Weights & Biases record these artefacts automatically and provide comparison and visualisation capabilities. Reproducibility requires that all inputs to a training run are retrievable and that the compute environment can be recreated. For CPU-based training, dependency pinning, seed recording, data versioning, and containerised environments enable identical re-execution. GPU training introduces non-deterministic operations that prevent bitwise reproducibility; statistical reproducibility across multiple runs with confidence intervals provides a defensible alternative. The AISDP reproducibility specification must document the level achieved, mechanisms used, and known limitations. Organisations without dedicated tooling can use structured experiment log spreadsheets, though this approach is adequate only for systems with infrequent training runs.
Model development involves experimentation across different architectures, hyperparameters, feature sets, and training strategies.
Model development involves experimentation across different architectures, hyperparameters, feature sets, and training strategies. Experiment tracking captures this exploration systematically and connects it to the formal version control system. When a requires the organisation to explain why a particular model architecture and hyperparameter configuration was chosen, experiment tracking records provide the evidence base. They demonstrate that alternatives were evaluated, that the chosen configuration was selected on merit, and that the selection criteria aligned with the compliance requirements documented in the .
Experiment tracking tools such as MLflow Tracking, Weights & Biases, Neptune, and Comet record the parameters, metrics, and artefacts for each training run.
Experiment tracking tools such as MLflow Tracking, Weights & Biases, Neptune, and Comet record the parameters, metrics, and artefacts for each training run. For compliance purposes, the tracking system must capture the full specification of each run: code version, data version, hyperparameters, random seed, and compute environment. It must also record the training metrics at each epoch or iteration, the final evaluation metrics on the holdout test set, and the resulting model artefact with its content hash.
MLflow Tracking is the most widely adopted open-source option. It logs parameters, metrics, and artefacts per run and provides a UI for run comparison and visualisation. covers the broader pipeline context in which experiment tracking operates. Weights & Biases adds collaboration features such as shared dashboards and run notes, along with richer visualisation including learning curves and parameter importance plots.
A competent authority or notified body may require the organisation to reproduce a specific training run to verify the claims in the AISDP.
A competent authority or notified body may require the organisation to reproduce a specific training run to verify the claims in the AISDP. Reproducibility requires that the exact code version, data version, and configuration are retrievable, that the compute environment can be recreated, that the random seed is recorded and can be set to produce the same initialisation, and that the training framework's version is pinned, since framework updates can change numerical behaviour even with the same seed.
Full bitwise reproducibility is not always achievable, particularly for GPU-accelerated training where non-deterministic operations such as atomic floating-point additions are common. The AISDP must document the level of reproducibility the system achieves, the factors that may cause variation between runs, and the tolerance bounds within which reproduced results are considered consistent. Where bitwise reproducibility is not achievable, statistical reproducibility (results within defined confidence intervals across multiple runs) is demonstrated and documented by the Technical SME.
For deterministic algorithms on CPU, reproducibility is achievable through dependency pinning, random seed recording, and data versioning.
For deterministic algorithms on CPU, reproducibility is achievable through dependency pinning, random seed recording, and data versioning. Poetry or Conda lock files pin every dependency, including transitive dependencies, to exact versions. The random seed is logged as a hyperparameter, data is versioned via DVC, and the training environment is captured as a Docker image. Given these four elements, the training run can be re-executed identically.
For GPU training, bitwise reproducibility is often unachievable because GPU operations use non-deterministic algorithms for performance. Parallel floating-point reductions process elements in different orders across executions. NVIDIA CUDA provides a deterministic mode ( in PyTorch) that forces deterministic operations, but at a significant performance penalty and with some operations unsupported.
For AISDP purposes, the reproducibility specification should document which level of reproducibility is achieved (bitwise or statistical), the mechanisms used to achieve it, and any known limitations.
For AISDP purposes, the reproducibility specification should document which level of reproducibility is achieved (bitwise or statistical), the mechanisms used to achieve it, and any known limitations. For GPU-trained models, the Technical SME retains multiple-run results and confidence intervals as Module 5 evidence. Technical Documentation and Evidence details how this evidence integrates into the broader documentation structure.
Experiment tracking can be done manually through a structured experiment log spreadsheet.
Experiment tracking can be done manually through a structured experiment log spreadsheet. Columns should include experiment ID, date, hypothesis, all hyperparameters, data version, code commit, random seed, training duration, all declared evaluation metrics, notes, and outcome determination (selected for further development, rejected, or baseline). Model artefacts are stored with the experiment ID as the directory name, and the log is reviewed during model selection to document why the deployed model was chosen.
This manual approach loses visualisation features such as learning curves and parameter importance plots, easy run comparison, and automatic metric logging. It is adequate for systems with infrequent training runs (fewer than ten per quarter). For active experimentation with dozens or hundreds of runs, tools such as MLflow, ClearML, or Neptune are needed; all have free tiers.
MLflow Tracking is the most widely adopted open-source option. Weights & Biases, Neptune, and Comet are also used. All record parameters, metrics, and artefacts per run. MLflow, ClearML, and Neptune offer free tiers suitable for initial compliance work.
No. The regulation requires documented reproducibility, not necessarily bitwise exactness. Where GPU non-determinism prevents identical results, statistical reproducibility with confidence intervals across multiple runs is an accepted approach, provided the AISDP documents the level achieved and the limitations.
Yes, for systems with infrequent training runs (fewer than ten per quarter). A structured spreadsheet with experiment ID, hyperparameters, data version, code commit, random seed, metrics, and outcome determination is adequate. Active experimentation with many runs requires dedicated tooling.
Conformity assessment requires organisations to explain model selection decisions, and experiment tracking provides the evidence base showing alternatives were evaluated and selection aligned with compliance requirements.
The system must capture code version, data version, hyperparameters, random seed, compute environment, training metrics at each epoch, final evaluation metrics, and the model artefact with its content hash.
The AISDP must document the level of reproducibility achieved, factors causing variation, and tolerance bounds. Statistical reproducibility is acceptable where bitwise reproducibility is not achievable.
CPU reproducibility uses dependency pinning, seed recording, data versioning, and containerised environments. GPU training requires statistical reproducibility through multiple runs with confidence intervals.
The specification documents which level of reproducibility is achieved, the mechanisms used, any known limitations, and for GPU models, multiple-run results with confidence intervals as Module 5 evidence.
A structured experiment log spreadsheet with columns for experiment ID, hyperparameters, data version, code commit, random seed, metrics, and outcome determination serves as a manual alternative.
Without experiment tracking, the Technical SME must reconstruct this narrative from scattered logs and team memory. That approach is fragile and error-prone, making it inadequate for regulatory scrutiny.
The compliance value becomes apparent during conformity assessment. An assessor reviewing the AISDP's Module 5 needs to understand how the deployed model was selected from among the candidate models that were trained. The experiment tracking system provides this narrative: the set of experiments that were run, the metrics comparison that led to the selection, and the specific run that produced the deployed model.
torch.use_deterministic_algorithms(True)The practical alternative is statistical reproducibility: run the training specification multiple times (three to five runs is standard), compute the confidence interval of the evaluation metrics across runs, and declare the model's performance as the mean plus or minus the interval. This approach acknowledges the inherent stochasticity while providing a defensible performance range. Validation and Testing Frameworks covers how confidence intervals feed into validation gates. The gate should compare the lower bound of the interval against the threshold, ensuring that the model meets the declared standard even under stochastic variation.
The specification bridges experiment tracking and reproducibility, which are two sides of the same compliance requirement: the ability to reconstruct how a model was produced. Experiment tracking records what happened during each training run; reproducibility ensures that the same run specification would produce the same or statistically equivalent result if re-executed.
CTO of Standard Intelligence. Leads platform engineering and contributes to the PIG series technical content.
Technical Documentation Contents