Everything is Varied The Surprising Impact of Individual Variation on ML Reliability in Medicine_2

2025-04-27 0 0 547.49KB 11 页 10玖币

侵权投诉

Everything is Varied: The Surprising Impact of Individual Variation on ML

Robustness in Medicine

Andrea Campagner1, Lorenzo Famiglini1, Anna Carobene3, Federico Cabitza1,2

1Dipartimento di Informatica, Sistemistica e Comunicazione, University of Milano-Bicocca, Milano, Italy

2IRCCS Istituto Ortopedico Galeazzi, Milano, Italy

3IRCCS Ospedale San Raffaele, Milano, Italy

Abstract

In medical settings, Individual Variation (IV) refers to vari-

ation that is due not to population differences or errors, but

rather to within-subject variation, that is the intrinsic and

characteristic patterns of variation pertaining to a given in-

stance or the measurement process. While taking into account

IV has been deemed critical for proper analysis of medical

data, this source of uncertainty and its impact on robustness

have so far been neglected in Machine Learning (ML). To

ﬁll this gap, we look at how IV affects ML performance and

generalization and how its impact can be mitigated. Speciﬁ-

cally, we provide a methodological contribution to formalize

the problem of IV in the statistical learning framework and,

through an experiment based on one of the largest real-world

laboratory medicine datasets for the problem of COVID-

19 diagnosis, we show that: 1) common state-of-the-art ML

models are severely impacted by the presence of IV in data;

and 2) advanced learning strategies, based on data augmenta-

tion and data imprecisiation, and proper study designs can be

effective at improving robustness to IV. Our ﬁndings demon-

strate the critical relevance of correctly accounting for IV to

enable safe deployment of ML in clinical settings.

Introduction

In recent years, the interest toward the application of Ma-

chine Learning (ML) methods and systems to the devel-

opment of decision support systems in clinical settings has

been steadily increasing (Benjamens, Dhunnoo, and Meskó

2020). This interest has been mainly driven by the promis-

ing results obtained and reported by these systems in aca-

demic research for different tasks (Aggarwal et al. 2021;

Yin, Ngiam, and Teo 2021)

Despite these promising results, the adoption of ML-

based systems in real-world clinical settings has been lag-

ging behind (Wilkinson et al. 2020), with these systems of-

ten failing to meet the expectations and requirements needed

for safe deployment in clinical settings (Andaur Navarro et

al. 2021; Futoma et al. 2020), a concept that has been termed

the last mile of implementation (Coiera 2019) . While rea-

sons behind the gaps in this “last mile” are numerous, among

them we recall the inability of ML systems to reliably gener-

alize in new contexts and settings (Beam, Manrai, and Ghas-

semi 2020; Christodoulou et al. 2019), as well as their lack

of robustness and susceptibility to variation in data, leading

to poorer performance in real settings (Li et al. 2019) and,

ultimately, to what has been called the replication crisis of

ML in medicine (Coiera et al. 2018).

In the ML literature, the notion of variation has usually

been associated with variance in the population data dis-

tribution (that is, as it relates to either the larger reference

population, or a sample taken from this latter), due to the

presence of outliers or anomalies (Akoglu 2021), out-of-

distribution instances (Adila and Kang 2022; Morteza and Li

2022) or concept/co-variate shifts and drifts (Liu et al. 2021;

Rabanser, Günnemann, and Lipton 2019). While these forms

of variation are certainly relevant, however they are not the

only ones that can arise in real-world settings: indeed, an-

other source of variation in data is the so-called individual

variation (IV) (Fraser 2001), which is especially common in

laboratory data or other physiological signals and biomark-

ers, and more generally in every phenomenon whose mani-

festations can exhibit time-varying patterns.

IV denotes between-subject variation that is not due to

population differences or errors, but rather to the intrinsic

and characteristic (that is, individual) patterns of variation

pertaining to single instances, that is to within-subject vari-

ation (Plebani, Padoan, and Lippi 2015); and more speciﬁ-

cally it relates to two possible sources of variation: either the

feature values for a given subject or patient, what is called bi-

ological variation (BV) (Plebani, Padoan, and Lippi 2015);

or the measurement process and instrument itself, i.e., what

is called analytical variation (AV). The presence of IV en-

tails (Badrick 2021) that for each individual one can iden-

tify a “subject average” or central tendency (homeostatic

point (Fraser 2001)) arising from such factors as personal

characteristic of the individuals themselves (e.g., genetic

characteristics, age, phenotypic elements such as diet and

physical activity) or of the measurement instrument (e.g.,

instrument calibration), as well as a distribution of possible

values, whose uncertainty is represented by the extent of the

IV: crucially, only a snapshot (i.e., a sample) from this dis-

tribution can be accessed at any moment.

While the potential impact of IV on computer-supported

diagnosis has been known for a while (for instance,

in (Spodick and Bishop 1997) authors reported that “com-

puter interpretations of electrocardiograms recorded 1

minute apart were signiﬁcantly (grossly) different in 4 of 10

cases”), only conjectures have so far been produced to esti-

mate its extent. Nonetheless, IV has two strong implications

for ML applications. First, ML models trained on data af-

fected by IV, even highly accurate ones, can fail to be robust

and properly generalize not only to new patients, but also to

the same patients observed in slightly different conditions:

for example, an healthy patient could indeed be classiﬁed

as healthy with respect to the features actually observed for

them, while they could have been classiﬁed as non-healthy

for a slightly different set of feature values, which never-

theless would still be totally compatible with the distribu-

tion due to IV1. Second, differently from distribution-related

variation, collecting additional data samples, which has been

considered a primary factor in the continued improvement

of ML systems, can help only marginally in reducing the

impact of IV, unless speciﬁc study designs are adopted that

allow to capture multiple observations for each individuals

across time (Aarsand et al. 2018; Bartlett et al. 2015).

Despite these apparently relevant characteristics, the phe-

nomenon of IV has largely been overlooked in the ML liter-

ature: indeed, while recent works have started to apply ML

techniques to analyze IV data, for example to cluster patients

based on their IV proﬁles (Carobene et al. 2021) or to pro-

vide Bayesian models for IV (Aarsand et al. 2021), to our

knowledge no previous work has investigated the impact of

IV on ML systems, as well as possible techniques to improve

robustness and manage this source of perturbations.

In this article, we attempt to bridge this gap in the special-

ized literature, by addressing two main research problems.

To this aim, this paper will consist of two parts: in the ﬁrst

part we will address the research question “can individual

variation signiﬁcantly affect the accuracy, and hence the ro-

bustness, of a machine model on a diagnostic task grounding

on laboratory medicine data” (H1). Due to the pervasive-

ness of individual variation, proving this hypothesis could

suggest that most ML models could be seriously affected

by lack of robustness on real-world and external data. To

this aim, we will apply a biologically-grounded, generative

model to simulate the effects of IV on data, and we will show

how commonly used classes of ML models fail to be ro-

bust to it. On the other hand, the second part of the paper

will aim to build on the rubble left by the ﬁrst part, and it

will address the hypothesis whether more advanced learn-

ing and regularization methods (grounding on, either, data

augmentation (Van Dyk and Meng 2001) or data imprecisia-

tion (Lienen and Hüllermeier 2021b)) will achieve increased

robustness in face of the same perturbations (H2).

Background and Methods

As discussed in the previous section, the aim of this arti-

cle is to evaluate and address the potential impact of IV

on ML models’ robustness. In this section, we ﬁrst provide

1As we show in the following, this setting is a generalization

of the usual one adopted in ML theory (Shalev-Shwartz and Ben-

David 2014): not only we assume that the best model could have

less than perfect accuracy, but we also assume that any instance is

represented as a distribution of vectors possibly lying in opposite

sides of the decision boundary.

basic background on IV, its importance in clinical settings,

and methods to compute it. Then, in the next sections, we

will describe two different experiments: in the ﬁrst exper-

iment, we evaluate how commonly used ML models fare

when dealing with data affected by IV; then, in the second

experiment, we evaluate the application of more advanced

ML approaches to improve robustness to IV.

Individual Variation in Medical Data

IV is considered one of the most important sources of uncer-

tainty in clinical data (Plebani, Padoan, and Lippi 2015) and

recent research has highlighted the need to take IV properly

into account in any use of medical data (Badrick 2021; Fröh-

lich et al. 2018) . IV can be understood as encompassing

three main components: pre-analytical variation, analytical

variation and (within-subject) biological variation (Fraser

2001; Plebani, Padoan, and Lippi 2015).

Pre-analytical variation denotes uncertainty due to pa-

tients’ preparation (e.g., fasting, physical activity, use of

medicaments) or sample management (including, collec-

tion, transport, storage and treatment) (Ellervik and Vaught

2015); it is usually understood that pre-analytic variation can

be controlled by means of careful laboratory practice (Fraser

2001). AV, by contrast, describes the un-eliminable uncer-

tainty which is inherent to every measurement technique,

and is characterized by both a random component (i.e., vari-

ance, that is the agreement between consecutive measure-

ments taken with the same instrument); and a systematic

component (i.e., bias, that is the differences in values re-

ported by two different measurement instruments). Finally,

BV describes the uncertainty arising from the fact that fea-

tures or biomarkers can change through time, contributing

to a variance in outcomes from the same individual that is

independent of other forms of variation.

As already mentioned, IV can inﬂuence the interpretation

and analysis of any clinical data: for this reason, quantifying

IV, also in terms of its components, is of critical importance.

However collecting reliable data about IV is not an easy

task (Carobene et al. 2018; Haeckel, Carobene, and Wos-

niok 2021). To this aim, standardized methodologies have

recently been proposed (Aarsand et al. 2018; Bartlett et al.

2015): intuitively, IV can be estimated (Aarsand et al. 2021;

Carobene et al. 2018; Røraas, Petersen, and Sandberg 2012)

by means of controlled experimental studies that monitor

reference individuals2(Carobene et al. 2016) by collecting

multiple samples over time.

Formally speaking, let us assume that a given feature

of interest xhas been monitored in npatients for mtime

steps. At each time step, k≥2repeated measurements

should be performed, so as to determine the AV compo-

nent of IV. Then, the IV of feature x, for patient i, is es-

timated as IVi(x) = V ariance(xi), while the AV com-

ponent is deﬁned as AVi(x) = V ariance(xi

s), where xi

denotes the collection of values of xfor patient i, and xi

denotes the collection of values of xfor patient iat the s-

2The term reference individual denotes an individual that, for

some reasons, can be considered representative of the population

of interest (e.g., healthy patients).

th time step. Then, the BV component of IV is computed

as BVi(x) = pIVi(x)2−AVi(x)2. Usually, IV, AV and

BV are expressed in percent terms, deﬁning the so-called

coefﬁcients of individual (resp., analytical, biological) vari-

ation, that is CV Ti(x) = IVi(x)

ˆxi,CV Ai(x) = AVi(x)

ˆxi

and CV Ii(x) = BVi(x)

ˆxi. The overall variations, ﬁnally, can

be computed as the average of the coefﬁcients of variation

across the population of patients. The value of CVT, for a

given set of features x= (x1, ..., xd), can then be used

to model the uncertainty about the observations obtained

for any given patient i: indeed, any patient i, as a con-

sequence of the uncertainty due to IV, can be represented

by a d-dimensional Gaussian Ni(xi,Σi), where xiis a d-

dimensional vector characteristic representation of patient i,

called value at the homeostatic point, and Σiis the diagonal

covariance matrix given by Σi

j,j =CV T (xj)∗xi

j(Fraser

2001). More generally, having observed a realization ˆxiof

Ni(xi,Σi)for patient i, its distribution can be estimated as

Ni(ˆxi,ˆ

Σi), where ˆ

Σi

j,j =CV T (xj)∗ˆxi

Due to the complexity of design studies to obtain reli-

able IV estimates, a few compiled sources of IV data, for

healthy patients, are available: the largest existing reposi-

tories in this sense, are the data originating from the Eu-

ropean Biological Variation Study (EuBIVAS) and the Bi-

ological Variation Database (BVD) (Aarsand et al. 2020;

Sandberg, Carobene, and Aarsand 2022), both encompass-

ing data about commonly used laboratory biomarkers. In the

following sections, we will rely on data available from these

sources in the deﬁnition of our experiments.

Individual Variation and Statistical Learning

One of the most simple yet remarkable results in Statisti-

cal Learning Theory (SLT) is the error decomposition theo-

rem (Shalev-Shwartz and Ben-David 2014) (also called bias-

variance tradeoff, or bias-complexity tradeoff), which states

that the true risk LD(h)of a function hfrom a family H

w.r.t. to a distribution Don the instance space Z=X×Y

can be decomposed as:

LD(h) = Bayes +Bias +Est (1)

where Bayes =minf∈FLD(f)is the Bayes error, i.e.

the minimum error achievable by any measurable function;

Bias =minh0∈HLD(h0)−minf∈FLD(f)is the bias, i.e.

the gap between the Bayes error and the minimum error

achievable in class H;Est =LD(h)−minh0∈HLD(h0)is

the estimation error, i.e. the gap between the error achieved

by hand the minimum error achievable in H.

A striking consequence of IV for ML tasks regards a gen-

eralization of the error decomposition theorem due to the

impossibility of accessing the true distributional-valued rep-

resentation of instances but only a sample drawn from the re-

spective distributions. To formalize this notion, as in the pre-

vious section, denote with fi=N(xi,Σi)the distributional

representation due to IV for instance i. Then, the learn-

ing task can be formalized through the deﬁnition of a ran-

dom measure (Herlau, Schmidt, and Mørup 2016) ηdeﬁned

over the Borel σ-algebra (Z, B)on the instance space Z=

X×Y, which associates to each instance (x, y)a probabil-

ity measure N(x, Σ) ×δy, where δyis the Dirac measure at

y∈Y. A training set S={(x1, yi),...,(xm, ym)}is then

obtained by ﬁrst sampling mrandom measures f1, . . . , fm

from ηm, and then, for each i, by sampling a random ele-

ment (xi, yi)∼fi. Then, the IV-induced generalization of

the error decomposition theorem can be formulated as:

Lη(h) = Bayes

η+Bias

η+Est

η+IV

η(2)

Indeed, the true error of hw.r.t. ηcan be expressed as

Lη(h) = EF∼ηmh1

mPfi∈FE(xi,yi)∼fil(h, (xi, yi))i.

Letting Dbe the probability measure over X×Yob-

tained as the intensity measure (Kallenberg 2017) of

η, and LD(h) = ES∼DmLS(h)be the expected er-

ror of hw.r.t. to the sampling of a training set Sfrom

the product measure Dm, then the above expression

can be derived by setting Bayes

η=minf∈FLη(f),

Bias

η=minh0∈HLη(h0)−minf∈FLη(f),

Est

η=LD(h)−minh0∈HLη(h0)and IV

η=

EF∼ηm,S∼D1

mPiE(xi,yi)∼fil(h, (xi, yi)) −l(h, (x0

i, y0

i).

Thus, compared with Eq (1), Eq (2) includes an addi-

tional error term IV which measures the gap in perfor-

mance due to the inability to use the IV-induced distribu-

tional representation of the instances, bur rather only a sin-

gle instantiation of such distributions. This aspect is also re-

ﬂected in the estimation error component in which the ref-

erence minh0∈HLη(h0)is compared not with the true error

Lη(h)but rather with the expected error over all possible in-

stantiations LD(h). In the following sections, we will show,

through an experimental study, that the impact of IV can be

signiﬁcant and lead to an overestimation of any ML algo-

rithm’s performance and robustness.

Measuring the Impact of Individual Variation on

Machine Learning Models

In order to study whether and how the performance of a ML

model could be impacted by IV, we designed an experiment

through which we evaluated several commonly adopted ML

models in the task of COVID-19 diagnosis from routine

laboratory blood exams, using a public benchmark dataset.

Aside from its practical relevance (Cabitza et al. 2021), we

selected this task for three additional reasons. First, blood

exams are considered one of the most stable panels of ex-

ams (Coskun et al. 2020): this allows us to evaluate the im-

pact of IV in a conservative scenario where the features of

interest are affected by relatively low levels of variability.

Second, validated data about IV for healthy patients who

underwent blood exams are available in the specialized lit-

erature (Buoro et al. 2017a,b, 2018) and these exams have

high predictive power for the task of COVID-19 diagno-

sis (Chen et al. 2021b). Third, the selected dataset was asso-

ciated with a companion longitudinal study (authors a) that

has been used to estimate IV data for the COVID-19 posi-

tive patients: we believe this to be particularly relevant since,

even though IV data are available for healthy patients, no in-

formation of this kind is usually available for non-healthy

patients, due to the complexity of designing studies for the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EverythingisVaried:TheSurprisingImpactofIndividualVariationonMLRobustnessinMedicineAndreaCampagner1,LorenzoFamiglini1,AnnaCarobene3,FedericoCabitza1;21DipartimentodiInformatica,SistemisticaeComunicazione,UniversityofMilano-Bicocca,Milano,Italy2IRCCSIstitutoOrtopedicoGaleazzi,Milano,Italy3IRCCSOspeda...

展开>> 收起<<

Everything is Varied The Surprising Impact of Individual Variation on ML Reliability in Medicine_2.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Everything is Varied The Surprising Impact of Individual Variation on ML Reliability in Medicine_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: