and captured metrics as well as model processing source code, logs,
and environment dependencies are artifacts that are created and/or
used in this stage. The Operations Stage requires trained models
and corresponding dependencies such as libraries and runtimes
(e. g. via Docker container), uses model deployment and monitoring
source code which is typically wrapped into a web service with a
REST API for on-demand (online) service or scheduled for batch
(oine) execution, and captures execution logs and statistics. To
achieve comparability, traceability, and reproducibility of produced
data and model artifacts across multiple lifecycle iterations, it is es-
sential to also capture metadata artifacts that can be easily inspected
afterwards (e. g. model parameters, hyperparameters, lineage traces,
performance metrics) as well as software artifacts.
Manual management of artifacts is simply not ecient due to the
complexity and the required time. To meet the above requirements,
it is necessary to systematically capture any input and output ar-
tifacts and to provide them via appropriate interfaces. ML artifact
management includes any methods and tools for managing ML arti-
facts that are created and used in the development, deployment, and
operation of ML-based systems. Systems supporting ML artifact
management, collectively referred to as ML artifact management
systems (ML AMSs), provide the functionality and interfaces to
adequately record, store, and manage ML lifecycle artifacts.
4 ASSESSMENT CRITERIA
The goal of this section is to dene criteria for the description and
assessment of AMSs. Based on a priori assumptions, we rst list
functional and non-functional requirements. We then conduct a
systematic literature review according to Kitchenham et al. [
81
]: Us-
ing well-dened keywords, we search ACM DL, DBLP, IEEE Xplore,
and SpringerLink for academic publications as well as Google and
Google Scholar for web pages, articles, white papers, technical re-
ports, reference lists, source code repositories, and documentations.
Next, we perform the publication selection based on the relevance
for answering our research questions. To avoid overlooking relevant
literature, we perform one iteration of backward snowballing [
171
].
Finally, we iteratively extract assessment criteria and subcriteria,
criteria categories, as well as the functional and non-functional
properties of concrete systems and platforms based on concept
matrices. The results are shown in Tab. 1, which outlines categories,
criteria (italicized), subcriteria (in square brackets).
Lifecycle Integration. This category describes for which parts of
the ML lifecycle a system provides artifact collection and manage-
ment capabilities. The four stages form the criteria, with the steps
assigned to each stage forming the subcriteria (cf. § 3.1).
Artifact Support. Orthogonal to the previous category, this cate-
gory indicates which types of artifacts are supported and managed
by an AMS. Based on the discussion in § 3.2, we distinguish between
the criteria Data-related,Model,Metadata, and Software Artifacts.
The criteria Data-related Artifacts and Model Artifacts represent
core resources that are either input, output, or both for a lifecycle
step. Data-related Artifacts are datasets (used for training, validation,
and testing), annotations and labels, and features (cf. correspond-
ing subcriteria). Model Artifacts are represented by trained models
(subcriterion Model).
The criteria Metadata Artifacts and Software Artifacts represent
the corresponding artifact types, that enable the reproducibility
and traceability of individual ML lifecycle steps and their results.
The criterion Metadata Artifacts covers dierent types of metadata:
(i) identication metadata (e. g. identier, name, type of dataset
or model, association with groups, experiments, pipelines, etc.);
(ii) data-related metadata; (iii) model-related metadata, such as
inspectable model parameters (e. g. weights and biases), model hy-
perparameters (e. g. number of hidden layers, learning rate, batch
size, or dropout), and model quality & performance metrics (e. g.
accuracy, F1-score, or AUC score); (iv) experiments and projects,
which are abstractions to capture data processing or model training
runs and to group related artifacts in a reproducible and compa-
rable way; (v) pipelines, which are abstractions to execute entire
ML workows in an automated fashion and relates the input and
output artifacts required per step as well as the glue code required
for processing; (vi) execution-related logs & statistics.
The criterion Software Artifacts comprises source code and note-
books, e. g. for data processing, experimentation and model training,
and serving, as well as congurations and execution-related envi-
ronment dependencies and containers, e. g. Conda environments,
Docker containers, or virtual machines.
Operations. This category indicates the operations provided by
an AMS for handling and managing ML artifacts. It comprises
the criteria Logging & Versioning,Exploration,Management, and
Collaboration.
The criterion Logging & Versioning represents any operations
that enable logging or capturing single artifacts (subcriterion Log/
Capture), creating checkpoints of a project or an experiment com-
prising several artifacts (subcriterion Commit), and reverting or
rolling back to an earlier committed or snapshot version (subcrite-
rion Revert/Rollback).
The criterion Exploration includes any operations that help to
gain concrete insights into the results of data processing pipelines,
experiments, model training results, or monitoring statistics. These
operations are dierentiated by the subcriteria Query,Compare,
Lineage,Provenance, and Visualize.Query operations may be repre-
sented by simple searching and listing functionality, more advanced
ltering functionality (e. g. based on model performance metrics), or
a comprehensive query language. Compare indicates the presence
of operations for the comparison between two or more versions of
artifacts. In terms of model artifacts, this operation may be used to
select the most promising model from a set of candidates (model se-
lection), either in model training and development [
122
] or in model
serving (e. g. best performing predictor) [
33
]. Lineage represents
any operations for tracing the lineage of artifacts, i. e. which input
artifacts led to which output artifacts, and thus provide information
about the history of a model, dataset, or project. Provenance repre-
sents any operations, which in addition provide information about
which concrete transformations and processes converted inputs
into an output. Visualize indicates the presence of functionality for
graphical representation of model architectures, pipelines, model
metrics, or experimentation results.
The criterion Management characterizes operations for handling
and using stored artifacts. The subcriteria Modify and Delete indi-
cate operations for modifying or deleting logged and already stored