
cases”), only conjectures have so far been produced to esti-
mate its extent. Nonetheless, IV has two strong implications
for ML applications. First, ML models trained on data af-
fected by IV, even highly accurate ones, can fail to be robust
and properly generalize not only to new patients, but also to
the same patients observed in slightly different conditions:
for example, an healthy patient could indeed be classified
as healthy with respect to the features actually observed for
them, while they could have been classified as non-healthy
for a slightly different set of feature values, which never-
theless would still be totally compatible with the distribu-
tion due to IV1. Second, differently from distribution-related
variation, collecting additional data samples, which has been
considered a primary factor in the continued improvement
of ML systems, can help only marginally in reducing the
impact of IV, unless specific study designs are adopted that
allow to capture multiple observations for each individuals
across time (Aarsand et al. 2018; Bartlett et al. 2015).
Despite these apparently relevant characteristics, the phe-
nomenon of IV has largely been overlooked in the ML liter-
ature: indeed, while recent works have started to apply ML
techniques to analyze IV data, for example to cluster patients
based on their IV profiles (Carobene et al. 2021) or to pro-
vide Bayesian models for IV (Aarsand et al. 2021), to our
knowledge no previous work has investigated the impact of
IV on ML systems, as well as possible techniques to improve
robustness and manage this source of perturbations.
In this article, we attempt to bridge this gap in the special-
ized literature, by addressing two main research problems.
To this aim, this paper will consist of two parts: in the first
part we will address the research question “can individual
variation significantly affect the accuracy, and hence the ro-
bustness, of a machine model on a diagnostic task grounding
on laboratory medicine data” (H1). Due to the pervasive-
ness of individual variation, proving this hypothesis could
suggest that most ML models could be seriously affected
by lack of robustness on real-world and external data. To
this aim, we will apply a biologically-grounded, generative
model to simulate the effects of IV on data, and we will show
how commonly used classes of ML models fail to be ro-
bust to it. On the other hand, the second part of the paper
will aim to build on the rubble left by the first part, and it
will address the hypothesis whether more advanced learn-
ing and regularization methods (grounding on, either, data
augmentation (Van Dyk and Meng 2001) or data imprecisia-
tion (Lienen and Hüllermeier 2021b)) will achieve increased
robustness in face of the same perturbations (H2).
Background and Methods
As discussed in the previous section, the aim of this arti-
cle is to evaluate and address the potential impact of IV
on ML models’ robustness. In this section, we first provide
1As we show in the following, this setting is a generalization
of the usual one adopted in ML theory (Shalev-Shwartz and Ben-
David 2014): not only we assume that the best model could have
less than perfect accuracy, but we also assume that any instance is
represented as a distribution of vectors possibly lying in opposite
sides of the decision boundary.
basic background on IV, its importance in clinical settings,
and methods to compute it. Then, in the next sections, we
will describe two different experiments: in the first exper-
iment, we evaluate how commonly used ML models fare
when dealing with data affected by IV; then, in the second
experiment, we evaluate the application of more advanced
ML approaches to improve robustness to IV.
Individual Variation in Medical Data
IV is considered one of the most important sources of uncer-
tainty in clinical data (Plebani, Padoan, and Lippi 2015) and
recent research has highlighted the need to take IV properly
into account in any use of medical data (Badrick 2021; Fröh-
lich et al. 2018) . IV can be understood as encompassing
three main components: pre-analytical variation, analytical
variation and (within-subject) biological variation (Fraser
2001; Plebani, Padoan, and Lippi 2015).
Pre-analytical variation denotes uncertainty due to pa-
tients’ preparation (e.g., fasting, physical activity, use of
medicaments) or sample management (including, collec-
tion, transport, storage and treatment) (Ellervik and Vaught
2015); it is usually understood that pre-analytic variation can
be controlled by means of careful laboratory practice (Fraser
2001). AV, by contrast, describes the un-eliminable uncer-
tainty which is inherent to every measurement technique,
and is characterized by both a random component (i.e., vari-
ance, that is the agreement between consecutive measure-
ments taken with the same instrument); and a systematic
component (i.e., bias, that is the differences in values re-
ported by two different measurement instruments). Finally,
BV describes the uncertainty arising from the fact that fea-
tures or biomarkers can change through time, contributing
to a variance in outcomes from the same individual that is
independent of other forms of variation.
As already mentioned, IV can influence the interpretation
and analysis of any clinical data: for this reason, quantifying
IV, also in terms of its components, is of critical importance.
However collecting reliable data about IV is not an easy
task (Carobene et al. 2018; Haeckel, Carobene, and Wos-
niok 2021). To this aim, standardized methodologies have
recently been proposed (Aarsand et al. 2018; Bartlett et al.
2015): intuitively, IV can be estimated (Aarsand et al. 2021;
Carobene et al. 2018; Røraas, Petersen, and Sandberg 2012)
by means of controlled experimental studies that monitor
reference individuals2(Carobene et al. 2016) by collecting
multiple samples over time.
Formally speaking, let us assume that a given feature
of interest xhas been monitored in npatients for mtime
steps. At each time step, k≥2repeated measurements
should be performed, so as to determine the AV compo-
nent of IV. Then, the IV of feature x, for patient i, is es-
timated as IVi(x) = V ariance(xi), while the AV com-
ponent is defined as AVi(x) = V ariance(xi
s), where xi
denotes the collection of values of xfor patient i, and xi
s
denotes the collection of values of xfor patient iat the s-
2The term reference individual denotes an individual that, for
some reasons, can be considered representative of the population
of interest (e.g., healthy patients).