Everything is Varied The Surprising Impact of Individual Variation on ML Reliability in Medicine_2

2025-04-27 0 0 547.49KB 11 页 10玖币
侵权投诉
Everything is Varied: The Surprising Impact of Individual Variation on ML
Robustness in Medicine
Andrea Campagner1, Lorenzo Famiglini1, Anna Carobene3, Federico Cabitza1,2
1Dipartimento di Informatica, Sistemistica e Comunicazione, University of Milano-Bicocca, Milano, Italy
2IRCCS Istituto Ortopedico Galeazzi, Milano, Italy
3IRCCS Ospedale San Raffaele, Milano, Italy
Abstract
In medical settings, Individual Variation (IV) refers to vari-
ation that is due not to population differences or errors, but
rather to within-subject variation, that is the intrinsic and
characteristic patterns of variation pertaining to a given in-
stance or the measurement process. While taking into account
IV has been deemed critical for proper analysis of medical
data, this source of uncertainty and its impact on robustness
have so far been neglected in Machine Learning (ML). To
fill this gap, we look at how IV affects ML performance and
generalization and how its impact can be mitigated. Specifi-
cally, we provide a methodological contribution to formalize
the problem of IV in the statistical learning framework and,
through an experiment based on one of the largest real-world
laboratory medicine datasets for the problem of COVID-
19 diagnosis, we show that: 1) common state-of-the-art ML
models are severely impacted by the presence of IV in data;
and 2) advanced learning strategies, based on data augmenta-
tion and data imprecisiation, and proper study designs can be
effective at improving robustness to IV. Our findings demon-
strate the critical relevance of correctly accounting for IV to
enable safe deployment of ML in clinical settings.
Introduction
In recent years, the interest toward the application of Ma-
chine Learning (ML) methods and systems to the devel-
opment of decision support systems in clinical settings has
been steadily increasing (Benjamens, Dhunnoo, and Meskó
2020). This interest has been mainly driven by the promis-
ing results obtained and reported by these systems in aca-
demic research for different tasks (Aggarwal et al. 2021;
Yin, Ngiam, and Teo 2021)
Despite these promising results, the adoption of ML-
based systems in real-world clinical settings has been lag-
ging behind (Wilkinson et al. 2020), with these systems of-
ten failing to meet the expectations and requirements needed
for safe deployment in clinical settings (Andaur Navarro et
al. 2021; Futoma et al. 2020), a concept that has been termed
the last mile of implementation (Coiera 2019) . While rea-
sons behind the gaps in this “last mile” are numerous, among
them we recall the inability of ML systems to reliably gener-
alize in new contexts and settings (Beam, Manrai, and Ghas-
semi 2020; Christodoulou et al. 2019), as well as their lack
Copyright © 2022, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
of robustness and susceptibility to variation in data, leading
to poorer performance in real settings (Li et al. 2019) and,
ultimately, to what has been called the replication crisis of
ML in medicine (Coiera et al. 2018).
In the ML literature, the notion of variation has usually
been associated with variance in the population data dis-
tribution (that is, as it relates to either the larger reference
population, or a sample taken from this latter), due to the
presence of outliers or anomalies (Akoglu 2021), out-of-
distribution instances (Adila and Kang 2022; Morteza and Li
2022) or concept/co-variate shifts and drifts (Liu et al. 2021;
Rabanser, Günnemann, and Lipton 2019). While these forms
of variation are certainly relevant, however they are not the
only ones that can arise in real-world settings: indeed, an-
other source of variation in data is the so-called individual
variation (IV) (Fraser 2001), which is especially common in
laboratory data or other physiological signals and biomark-
ers, and more generally in every phenomenon whose mani-
festations can exhibit time-varying patterns.
IV denotes between-subject variation that is not due to
population differences or errors, but rather to the intrinsic
and characteristic (that is, individual) patterns of variation
pertaining to single instances, that is to within-subject vari-
ation (Plebani, Padoan, and Lippi 2015); and more specifi-
cally it relates to two possible sources of variation: either the
feature values for a given subject or patient, what is called bi-
ological variation (BV) (Plebani, Padoan, and Lippi 2015);
or the measurement process and instrument itself, i.e., what
is called analytical variation (AV). The presence of IV en-
tails (Badrick 2021) that for each individual one can iden-
tify a “subject average” or central tendency (homeostatic
point (Fraser 2001)) arising from such factors as personal
characteristic of the individuals themselves (e.g., genetic
characteristics, age, phenotypic elements such as diet and
physical activity) or of the measurement instrument (e.g.,
instrument calibration), as well as a distribution of possible
values, whose uncertainty is represented by the extent of the
IV: crucially, only a snapshot (i.e., a sample) from this dis-
tribution can be accessed at any moment.
While the potential impact of IV on computer-supported
diagnosis has been known for a while (for instance,
in (Spodick and Bishop 1997) authors reported that “com-
puter interpretations of electrocardiograms recorded 1
minute apart were significantly (grossly) different in 4 of 10
cases”), only conjectures have so far been produced to esti-
mate its extent. Nonetheless, IV has two strong implications
for ML applications. First, ML models trained on data af-
fected by IV, even highly accurate ones, can fail to be robust
and properly generalize not only to new patients, but also to
the same patients observed in slightly different conditions:
for example, an healthy patient could indeed be classified
as healthy with respect to the features actually observed for
them, while they could have been classified as non-healthy
for a slightly different set of feature values, which never-
theless would still be totally compatible with the distribu-
tion due to IV1. Second, differently from distribution-related
variation, collecting additional data samples, which has been
considered a primary factor in the continued improvement
of ML systems, can help only marginally in reducing the
impact of IV, unless specific study designs are adopted that
allow to capture multiple observations for each individuals
across time (Aarsand et al. 2018; Bartlett et al. 2015).
Despite these apparently relevant characteristics, the phe-
nomenon of IV has largely been overlooked in the ML liter-
ature: indeed, while recent works have started to apply ML
techniques to analyze IV data, for example to cluster patients
based on their IV profiles (Carobene et al. 2021) or to pro-
vide Bayesian models for IV (Aarsand et al. 2021), to our
knowledge no previous work has investigated the impact of
IV on ML systems, as well as possible techniques to improve
robustness and manage this source of perturbations.
In this article, we attempt to bridge this gap in the special-
ized literature, by addressing two main research problems.
To this aim, this paper will consist of two parts: in the first
part we will address the research question “can individual
variation significantly affect the accuracy, and hence the ro-
bustness, of a machine model on a diagnostic task grounding
on laboratory medicine data” (H1). Due to the pervasive-
ness of individual variation, proving this hypothesis could
suggest that most ML models could be seriously affected
by lack of robustness on real-world and external data. To
this aim, we will apply a biologically-grounded, generative
model to simulate the effects of IV on data, and we will show
how commonly used classes of ML models fail to be ro-
bust to it. On the other hand, the second part of the paper
will aim to build on the rubble left by the first part, and it
will address the hypothesis whether more advanced learn-
ing and regularization methods (grounding on, either, data
augmentation (Van Dyk and Meng 2001) or data imprecisia-
tion (Lienen and Hüllermeier 2021b)) will achieve increased
robustness in face of the same perturbations (H2).
Background and Methods
As discussed in the previous section, the aim of this arti-
cle is to evaluate and address the potential impact of IV
on ML models’ robustness. In this section, we first provide
1As we show in the following, this setting is a generalization
of the usual one adopted in ML theory (Shalev-Shwartz and Ben-
David 2014): not only we assume that the best model could have
less than perfect accuracy, but we also assume that any instance is
represented as a distribution of vectors possibly lying in opposite
sides of the decision boundary.
basic background on IV, its importance in clinical settings,
and methods to compute it. Then, in the next sections, we
will describe two different experiments: in the first exper-
iment, we evaluate how commonly used ML models fare
when dealing with data affected by IV; then, in the second
experiment, we evaluate the application of more advanced
ML approaches to improve robustness to IV.
Individual Variation in Medical Data
IV is considered one of the most important sources of uncer-
tainty in clinical data (Plebani, Padoan, and Lippi 2015) and
recent research has highlighted the need to take IV properly
into account in any use of medical data (Badrick 2021; Fröh-
lich et al. 2018) . IV can be understood as encompassing
three main components: pre-analytical variation, analytical
variation and (within-subject) biological variation (Fraser
2001; Plebani, Padoan, and Lippi 2015).
Pre-analytical variation denotes uncertainty due to pa-
tients’ preparation (e.g., fasting, physical activity, use of
medicaments) or sample management (including, collec-
tion, transport, storage and treatment) (Ellervik and Vaught
2015); it is usually understood that pre-analytic variation can
be controlled by means of careful laboratory practice (Fraser
2001). AV, by contrast, describes the un-eliminable uncer-
tainty which is inherent to every measurement technique,
and is characterized by both a random component (i.e., vari-
ance, that is the agreement between consecutive measure-
ments taken with the same instrument); and a systematic
component (i.e., bias, that is the differences in values re-
ported by two different measurement instruments). Finally,
BV describes the uncertainty arising from the fact that fea-
tures or biomarkers can change through time, contributing
to a variance in outcomes from the same individual that is
independent of other forms of variation.
As already mentioned, IV can influence the interpretation
and analysis of any clinical data: for this reason, quantifying
IV, also in terms of its components, is of critical importance.
However collecting reliable data about IV is not an easy
task (Carobene et al. 2018; Haeckel, Carobene, and Wos-
niok 2021). To this aim, standardized methodologies have
recently been proposed (Aarsand et al. 2018; Bartlett et al.
2015): intuitively, IV can be estimated (Aarsand et al. 2021;
Carobene et al. 2018; Røraas, Petersen, and Sandberg 2012)
by means of controlled experimental studies that monitor
reference individuals2(Carobene et al. 2016) by collecting
multiple samples over time.
Formally speaking, let us assume that a given feature
of interest xhas been monitored in npatients for mtime
steps. At each time step, k2repeated measurements
should be performed, so as to determine the AV compo-
nent of IV. Then, the IV of feature x, for patient i, is es-
timated as IVi(x) = V ariance(xi), while the AV com-
ponent is defined as AVi(x) = V ariance(xi
s), where xi
denotes the collection of values of xfor patient i, and xi
s
denotes the collection of values of xfor patient iat the s-
2The term reference individual denotes an individual that, for
some reasons, can be considered representative of the population
of interest (e.g., healthy patients).
th time step. Then, the BV component of IV is computed
as BVi(x) = pIVi(x)2AVi(x)2. Usually, IV, AV and
BV are expressed in percent terms, defining the so-called
coefficients of individual (resp., analytical, biological) vari-
ation, that is CV Ti(x) = IVi(x)
ˆxi,CV Ai(x) = AVi(x)
ˆxi
and CV Ii(x) = BVi(x)
ˆxi. The overall variations, finally, can
be computed as the average of the coefficients of variation
across the population of patients. The value of CVT, for a
given set of features x= (x1, ..., xd), can then be used
to model the uncertainty about the observations obtained
for any given patient i: indeed, any patient i, as a con-
sequence of the uncertainty due to IV, can be represented
by a d-dimensional Gaussian Ni(xi,Σi), where xiis a d-
dimensional vector characteristic representation of patient i,
called value at the homeostatic point, and Σiis the diagonal
covariance matrix given by Σi
j,j =CV T (xj)xi
j(Fraser
2001). More generally, having observed a realization ˆxiof
Ni(xi,Σi)for patient i, its distribution can be estimated as
Ni(ˆxi,ˆ
Σi), where ˆ
Σi
j,j =CV T (xj)ˆxi
j.
Due to the complexity of design studies to obtain reli-
able IV estimates, a few compiled sources of IV data, for
healthy patients, are available: the largest existing reposi-
tories in this sense, are the data originating from the Eu-
ropean Biological Variation Study (EuBIVAS) and the Bi-
ological Variation Database (BVD) (Aarsand et al. 2020;
Sandberg, Carobene, and Aarsand 2022), both encompass-
ing data about commonly used laboratory biomarkers. In the
following sections, we will rely on data available from these
sources in the definition of our experiments.
Individual Variation and Statistical Learning
One of the most simple yet remarkable results in Statisti-
cal Learning Theory (SLT) is the error decomposition theo-
rem (Shalev-Shwartz and Ben-David 2014) (also called bias-
variance tradeoff, or bias-complexity tradeoff), which states
that the true risk LD(h)of a function hfrom a family H
w.r.t. to a distribution Don the instance space Z=X×Y
can be decomposed as:
LD(h) = Bayes +Bias +Est (1)
where Bayes =minfFLD(f)is the Bayes error, i.e.
the minimum error achievable by any measurable function;
Bias =minh0HLD(h0)minfFLD(f)is the bias, i.e.
the gap between the Bayes error and the minimum error
achievable in class H;Est =LD(h)minh0HLD(h0)is
the estimation error, i.e. the gap between the error achieved
by hand the minimum error achievable in H.
A striking consequence of IV for ML tasks regards a gen-
eralization of the error decomposition theorem due to the
impossibility of accessing the true distributional-valued rep-
resentation of instances but only a sample drawn from the re-
spective distributions. To formalize this notion, as in the pre-
vious section, denote with fi=N(xi,Σi)the distributional
representation due to IV for instance i. Then, the learn-
ing task can be formalized through the definition of a ran-
dom measure (Herlau, Schmidt, and Mørup 2016) ηdefined
over the Borel σ-algebra (Z, B)on the instance space Z=
X×Y, which associates to each instance (x, y)a probabil-
ity measure N(x, Σ) ×δy, where δyis the Dirac measure at
yY. A training set S={(x1, yi),...,(xm, ym)}is then
obtained by first sampling mrandom measures f1, . . . , fm
from ηm, and then, for each i, by sampling a random ele-
ment (xi, yi)fi. Then, the IV-induced generalization of
the error decomposition theorem can be formulated as:
Lη(h) = Bayes
η+Bias
η+Est
η+IV
η(2)
Indeed, the true error of hw.r.t. ηcan be expressed as
Lη(h) = EFηmh1
mPfiFE(xi,yi)fil(h, (xi, yi))i.
Letting Dbe the probability measure over X×Yob-
tained as the intensity measure (Kallenberg 2017) of
η, and LD(h) = ESDmLS(h)be the expected er-
ror of hw.r.t. to the sampling of a training set Sfrom
the product measure Dm, then the above expression
can be derived by setting Bayes
η=minfFLη(f),
Bias
η=minh0HLη(h0)minfFLη(f),
Est
η=LD(h)minh0HLη(h0)and IV
η=
EFηm,SD1
mPiE(xi,yi)fil(h, (xi, yi)) l(h, (x0
i, y0
i).
Thus, compared with Eq (1), Eq (2) includes an addi-
tional error term IV which measures the gap in perfor-
mance due to the inability to use the IV-induced distribu-
tional representation of the instances, bur rather only a sin-
gle instantiation of such distributions. This aspect is also re-
flected in the estimation error component in which the ref-
erence minh0HLη(h0)is compared not with the true error
Lη(h)but rather with the expected error over all possible in-
stantiations LD(h). In the following sections, we will show,
through an experimental study, that the impact of IV can be
significant and lead to an overestimation of any ML algo-
rithm’s performance and robustness.
Measuring the Impact of Individual Variation on
Machine Learning Models
In order to study whether and how the performance of a ML
model could be impacted by IV, we designed an experiment
through which we evaluated several commonly adopted ML
models in the task of COVID-19 diagnosis from routine
laboratory blood exams, using a public benchmark dataset.
Aside from its practical relevance (Cabitza et al. 2021), we
selected this task for three additional reasons. First, blood
exams are considered one of the most stable panels of ex-
ams (Coskun et al. 2020): this allows us to evaluate the im-
pact of IV in a conservative scenario where the features of
interest are affected by relatively low levels of variability.
Second, validated data about IV for healthy patients who
underwent blood exams are available in the specialized lit-
erature (Buoro et al. 2017a,b, 2018) and these exams have
high predictive power for the task of COVID-19 diagno-
sis (Chen et al. 2021b). Third, the selected dataset was asso-
ciated with a companion longitudinal study (authors a) that
has been used to estimate IV data for the COVID-19 posi-
tive patients: we believe this to be particularly relevant since,
even though IV data are available for healthy patients, no in-
formation of this kind is usually available for non-healthy
patients, due to the complexity of designing studies for the
摘要:

EverythingisVaried:TheSurprisingImpactofIndividualVariationonMLRobustnessinMedicineAndreaCampagner1,LorenzoFamiglini1,AnnaCarobene3,FedericoCabitza1;21DipartimentodiInformatica,SistemisticaeComunicazione,UniversityofMilano-Bicocca,Milano,Italy2IRCCSIstitutoOrtopedicoGaleazzi,Milano,Italy3IRCCSOspeda...

展开>> 收起<<
Everything is Varied The Surprising Impact of Individual Variation on ML Reliability in Medicine_2.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:547.49KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注