Trustworthy clinical AI solutions a uniﬁed review of uncertainty quantiﬁcation in deep learning models for medical image analysis

2025-05-06 0 0 921.31KB 40 页 10玖币

侵权投诉

Trustworthy clinical AI solutions: a uniﬁed

review of uncertainty quantiﬁcation in deep

learning models for medical image analysis

Benjamin Lambert1,3, Florence Forbes2, Alan Tucholka3, Senan

Doyle3, Harmonie Dehaene3and Michel Dojat1

1Univ. Grenoble Alpes, Inserm, U1216, Grenoble Institut des

Neurosciences, Grenoble, 38000, France

2Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK,

Grenoble, 38000, France

3Pixyl Research and Development Laboratory, Grenoble, 38000,

France

Abstract

The full acceptance of Deep Learning (DL) models in the clinical ﬁeld

is rather low with respect to the quantity of high-performing solutions

reported in the literature. Particularly, end users are reluctant to rely on

the rough predictions of DL models. Uncertainty quantiﬁcation methods

have been proposed in the literature as a potential response to reduce

the rough decision provided by the DL black box and thus increase the

interpretability and the acceptability of the result by the ﬁnal user. In

this review, we propose an overview of the existing methods to quantify

uncertainty associated to DL predictions. We focus on applications to

medical image analysis, which present speciﬁc challenges due to the high

dimensionality of images and their quality variability, as well as constraints

associated to real-life clinical routine. We then discuss the evaluation

protocols to validate the relevance of uncertainty estimates. Finally, we

highlight the open challenges of uncertainty quantiﬁcation in the medical

ﬁeld.

1 Introduction

These past years, many Deep Learning (DL) medical applications were proposed

for the automatic analysis of various imaging modalities, including Magnetic

Resonance Imaging (MRI), Computed Tomography (CT), Ultrasound (US) or

histopathological images (see Puttagunta and Ravi (2021) for a review). To

be accepted and routinely used by clinicians, however, these algorithms must

arXiv:2210.03736v1 [eess.IV] 5 Oct 2022

provide robust and trustable predictions. This is of particular importance in

the context of clinical applications, where the automated prediction may have a

direct impact on patient care. Yet, DL models are often considered and used as

black-boxes, due to the absence of clear decision rules, as well as to the lack of

reliable conﬁdence estimates associated with their predictions (Guo et al., 2017).

Additionally, DL models proved to be overconﬁdent about their predictions on

outliers data (Nguyen et al., 2015), and very sensitive to adversarial attacks

(Ma et al., 2021), which suggests a global lack of robustness of this type of

models. Due to these limitations, detecting failures or inconsistencies produced

by DL models is complex, raising concerns regarding the reliability and safety

of using these algorithms in clinical practice (Ford et al., 2016). To tackle this

essential aspect, several research directions have emerged in order to mitigate

the "black-box issue", including Explainable Artiﬁcial Intelligence (XAI) and

Uncertainty Quantiﬁcation (UQ). XAI methods (Arrieta et al., 2020) propose

to explain the prediction of the DL model in a way that is understandable to hu-

mans. In the context of medical image analysis, an example of XAI approach is

the computing of saliency maps showing the image’s relevant features identiﬁed

by the DL model, or example-based explanations consisting in the presentation

of cases similar to the one considered, e.g. medical images of patients with the

same condition, (van der Velden et al., 2022). However, concerns have been

raised concerning the ﬁdelity and intelligibility of the explanations provided by

XAI methods, which may give the misleading impression of a better under-

standing of the black-box Adebayo et al. (2018); Rudin (2019). On the other

side, UQ methods (Abdar et al., 2021a) were developed to quantify the pre-

dictive uncertainty of a given DL model. Enhancing an automated prediction

with an estimation of its conﬁdence has numerous beneﬁts. First, it allows the

identiﬁcation of uncertain samples that need human reviewing. In a medical

setting, this is particularly crucial to prevent silent errors, that may lead to

inaccurate diagnosis or treatment. Second, it enables the identiﬁcation of the

model’s pitfalls. For example, unconﬁdent predictions can indicate an incom-

plete training dataset. It gives insights regarding the knowledge captured by

the model, and can be used to extend the training set with supplementary data,

if needed. High uncertainty can also reveal anomalies within the input data,

which is critical for Quality Control (QC). Overall, UQ increases trust in the

algorithm, and facilitates the interaction between the algorithm and the user.

Moreover, UQ beneﬁts from strong theoretical foundations and has emerged,

from the clinical point of view, as one of the expected property of a deployed AI

algorithm (Tonekaboni et al., 2019). As a result, the medical-imaging commu-

nity is becoming increasingly interested in incorporating UQ to image processing

pipelines in order to highlight model failures or weaknesses. In this work, we

propose a comprehensive overview of such an UQ integration in medical image

processing pipelines.

1.1 Research Outline

Several review articles focusing on uncertainty in DL can be found in the liter-

ature. In Abdar et al. (2021a), authors propose a complete review of UQ meth-

ods, as well as their various concrete applications. Hüllermeier and Waegeman

(2021) focus their article on the deﬁnition of the two main categories of uncer-

tainty, namely aleatoric and epistemic uncertainties, in the context of machine

learning applications. In Gawlikowski et al. (2021), insights about the various

sources of uncertainty are presented. Reviews focusing on Bayesian DL (Jospin

et al., 2022; Wang and Yeung, 2020) and prediction intervals Kabir et al. (2018)

have been also published. More recently, Zhou et al. (2022) present a review of

the latest advances considering epistemic uncertainty quantiﬁcation in DL from

the perspective of generalization error. While these various works propose a

complete overview of UQ methods in DL from a general point of view, we have

noticed the lack of reviews focusing on medical image processing applications,

where being able to correctly identify the conﬁdence of the model is crucial.

Kurz et al. (2022) presented a ﬁrst work in this direction, using a corpus of 22

papers. Their study, however, is restricted to medical image classiﬁcation. With

the present review, we propose to extend the latter by presenting a complete

review of 130 peer-reviewed papers implementing UQ applications in supervised

DL-based pipelines, for both medical image classiﬁcation and segmentation. We

also aim at providing an in-depth discussion of UQ methods’ evaluation proce-

dures, as well as pointing out the challenges of the ﬁeld and potential future

directions. Our review diﬀerentiates from other previously published ones by

the following contributions:

•A review of UQ methods dedicated to DL medical image processing clas-

siﬁcation and segmentation.

•A focus on the proposed metrics for uncertainty estimates evaluation.

•Discussion on the current challenges and limitations of UQ for medical

image analysis, and suggestion of future work directions.

1.2 Organization of this Review

This report is divided into four sections. Section 2 introduces the key concepts

addressed in this study, namely the application of DL models to medical image

classiﬁcation and segmentation (subsection 2.2), as well as the main notions of

UQ (subsection 2.3). Section 3 presents the most popular UQ methods applied

in the context of medical image analysis. Section 4 then focuses on the evalua-

tion procedures that can be implemented to assess the usefulness of uncertainty

estimates. Finally, Section 5 proposes a discussion of the current challenges and

gaps in the literature in the ﬁeld of UQ for DL medical image processing.

2 Framework

2.1 Problem setting

In this work, we focus on supervised learning approaches. With this classical

setting, the goal of the DL algorithm is to learn a task Tbased on a training

dataset composed of pairs of input images x, and their associated ground truth

y. This target represents a class in the context of classiﬁcation (e.g.,healthy,

pathological), whereas it consists in a mask for segmentation tasks (e.g., the

manual delineation of tumors). By observing multiple examples of pairs of im-

ages and their corresponding labels during training, the learning agent estimates

the mapping function p(y|x)from the data.

2.2 Deep Learning for medical image analysis

The common approach for supervised DL medical image processing is the train-

ing of a Convolutional Neural Network (CNN) using an annotated dataset (i.e.

the ground truth). The building block of CNNs is the convolutional layer, which

convolves the input data with learnable weighted kernels. This enables the ex-

traction of features within the image, while being insensitive to the position,

scale and shape.

For medical image classiﬁcation, popular convolutional architectures com-

prises Residual and Dense CNNs (Huang et al., 2017) or EﬃcientNets (Tan and

Le, 2019). These architectures consist of a succession of convolutional layers

that extract features from the image at diﬀerent scales while reducing its size,

thus its spatial resolution. For medical image segmentation, popular choices in-

clude U-Net (Ronneberger et al., 2015) and its variants, such as Residual U-Net

(Kerfoot et al., 2018), V-Net (Milletari et al., 2016), Attention U-Net (Oktay

et al., 2018) or Dynamic U-Net (Isensee et al., 2021). These segmentation mod-

els are composed of two branches, an encoder and a decoder, forming the U

shape. The encoder compresses the dimension of the input image, while the

decoder decompresses the signal until it recovers its original size. Between the

two modules, skip connections are usually added so that the features learned

in the encoder part can be used to generate the segmentation in the decoder

part. Similarly to medical images that can be either 2-dimensional (e.g. 2D

CT, Optical coherence tomography (OCT), microscopy or colonoscopy) or 3D

(e.g MRI, 3D CT, PET...), the CNNs can be implemented in 2D or 3D.

During the supervised training stage, the CNN uses images from the training

set to produces predictions, which are compared to the ground truth targets in

order to estimate the error of the model. To do so, a loss function is introduced

to estimate the discrepancy between predicted and true labels. Standard choices

for both image classiﬁcation and segmentation include the cross-entropy loss or

focal loss (Lin et al., 2020). For segmentation tasks, speciﬁc loss functions can

also be used such as the popular Dice loss (Milletari et al., 2016) and variants

(Generalized Dice loss (Fidon et al., 2017) or Tversky loss (Salehi et al., 2017)).

In the context of medical image classiﬁcation, CNNs provide a categorical

probability distribution over the diﬀerent observable classes, by applying a soft-

max function on the model’s output. The ﬁnal assigned class corresponds to

the one having the highest probability. The same process is applied for medical

image segmentation, except that the CNN predicts one class per pixel or voxel.

UQ aims at completing these predictions with uncertainty estimates, allowing

a better interpretation of the results with respect to the model’s conﬁdence. In

the following section, the main concepts of uncertainty are introduced.

2.3 The speciﬁc language of uncertainty

Predictive uncertainty, meaning the uncertainty associated with the prediction

of a DL model, is typically divided in two parts: model (or epistemic) and data

(or aleatoric) uncertainty.

Epistemic uncertainty describes uncertainty arising from the lack of knowl-

edge about the perfect predictor, considering the current input (Hüllermeier and

Waegeman, 2021). In complex scenarios, there is often not a single model, but

rather a multitude of models that can explain the observed data (Gal et al.,

2016). Thus, uncertainty arises regarding the choice of the model parameters.

Epistemic uncertainty is considered to be reducible, meaning that it can be re-

duced by using additional data. In practice, epistemic uncertainty is expected

to be high for images far from the training data distribution (referred to as out-

of-distribution (OOD) samples). Such discrepancy between test and training

datasets is frequent in medical image analysis, where there may be signiﬁcant

variations between images acquired at diﬀerent hospitals or using diﬀerent ma-

chines. Additionally, unexpected patterns can be encountered in test images,

such as diseases not encountered during training, or artifacts. Popular ap-

proaches to improve the generalizability of models to unseen domains include

data augmentation (Chen et al., 2020; Ouyang et al., 2021; Zhang et al., 2020)

or transfer learning (Ghafoorian et al., 2017).

Aleatoric uncertainty describes intrinsic noise and random eﬀects within the

data (Hüllermeier and Waegeman, 2021). It is not intrinsic to the model, but

rather a property of the underlying generative distribution of the data. In the

context of classiﬁcation or segmentation, aleatoric uncertainty increases when

the number of classes is high and when these classes are ﬁne-grained (Malinin,

2019). Aleatoric uncertainty is considered to be irreducible, meaning that it

cannot be reduced with more data. Actually, the only way to diminish aleatoric

uncertainty would be to increase the measurement system precision to reduce

noise that corrupts the dataset (Gal et al., 2016). Finally, aleatoric uncertainty

can be further split into two categories: homoscedastic uncertainty, which is

identical for each sample of the dataset, and heteroscedastic uncertainty, which

depends on the query input.

Lastly, closely linked to this notion of data uncertainty, the notion of la-

bel uncertainty was introduced for segmentation tasks. It has been observed

that inter-rater variability in the context of manual delineations of medical im-

ages was important (Becker et al., 2019; Joskowicz et al., 2019). This has a

direct impact on the model’s overall uncertainty as the same object of interest

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TrustworthyclinicalAIsolutions:auniedreviewofuncertaintyquanticationindeeplearningmodelsformedicalimageanalysisBenjaminLambert1,3,FlorenceForbes2,AlanTucholka3,SenanDoyle3,HarmonieDehaene3andMichelDojat11Univ.GrenobleAlpes,Inserm,U1216,GrenobleInstitutdesNeurosciences,Grenoble,38000,France2Univ.Gr...

展开>> 收起<<

Trustworthy clinical AI solutions a uniﬁed review of uncertainty quantiﬁcation in deep learning models for medical image analysis.pdf

共40页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Trustworthy clinical AI solutions a uniﬁed review of uncertainty quantiﬁcation in deep learning models for medical image analysis

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: