Trustworthy clinical AI solutions a unified review of uncertainty quantification in deep learning models for medical image analysis

2025-05-06 0 0 921.31KB 40 页 10玖币
侵权投诉
Trustworthy clinical AI solutions: a unified
review of uncertainty quantification in deep
learning models for medical image analysis
Benjamin Lambert1,3, Florence Forbes2, Alan Tucholka3, Senan
Doyle3, Harmonie Dehaene3and Michel Dojat1
1Univ. Grenoble Alpes, Inserm, U1216, Grenoble Institut des
Neurosciences, Grenoble, 38000, France
2Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK,
Grenoble, 38000, France
3Pixyl Research and Development Laboratory, Grenoble, 38000,
France
Abstract
The full acceptance of Deep Learning (DL) models in the clinical field
is rather low with respect to the quantity of high-performing solutions
reported in the literature. Particularly, end users are reluctant to rely on
the rough predictions of DL models. Uncertainty quantification methods
have been proposed in the literature as a potential response to reduce
the rough decision provided by the DL black box and thus increase the
interpretability and the acceptability of the result by the final user. In
this review, we propose an overview of the existing methods to quantify
uncertainty associated to DL predictions. We focus on applications to
medical image analysis, which present specific challenges due to the high
dimensionality of images and their quality variability, as well as constraints
associated to real-life clinical routine. We then discuss the evaluation
protocols to validate the relevance of uncertainty estimates. Finally, we
highlight the open challenges of uncertainty quantification in the medical
field.
1 Introduction
These past years, many Deep Learning (DL) medical applications were proposed
for the automatic analysis of various imaging modalities, including Magnetic
Resonance Imaging (MRI), Computed Tomography (CT), Ultrasound (US) or
histopathological images (see Puttagunta and Ravi (2021) for a review). To
be accepted and routinely used by clinicians, however, these algorithms must
1
arXiv:2210.03736v1 [eess.IV] 5 Oct 2022
provide robust and trustable predictions. This is of particular importance in
the context of clinical applications, where the automated prediction may have a
direct impact on patient care. Yet, DL models are often considered and used as
black-boxes, due to the absence of clear decision rules, as well as to the lack of
reliable confidence estimates associated with their predictions (Guo et al., 2017).
Additionally, DL models proved to be overconfident about their predictions on
outliers data (Nguyen et al., 2015), and very sensitive to adversarial attacks
(Ma et al., 2021), which suggests a global lack of robustness of this type of
models. Due to these limitations, detecting failures or inconsistencies produced
by DL models is complex, raising concerns regarding the reliability and safety
of using these algorithms in clinical practice (Ford et al., 2016). To tackle this
essential aspect, several research directions have emerged in order to mitigate
the "black-box issue", including Explainable Artificial Intelligence (XAI) and
Uncertainty Quantification (UQ). XAI methods (Arrieta et al., 2020) propose
to explain the prediction of the DL model in a way that is understandable to hu-
mans. In the context of medical image analysis, an example of XAI approach is
the computing of saliency maps showing the image’s relevant features identified
by the DL model, or example-based explanations consisting in the presentation
of cases similar to the one considered, e.g. medical images of patients with the
same condition, (van der Velden et al., 2022). However, concerns have been
raised concerning the fidelity and intelligibility of the explanations provided by
XAI methods, which may give the misleading impression of a better under-
standing of the black-box Adebayo et al. (2018); Rudin (2019). On the other
side, UQ methods (Abdar et al., 2021a) were developed to quantify the pre-
dictive uncertainty of a given DL model. Enhancing an automated prediction
with an estimation of its confidence has numerous benefits. First, it allows the
identification of uncertain samples that need human reviewing. In a medical
setting, this is particularly crucial to prevent silent errors, that may lead to
inaccurate diagnosis or treatment. Second, it enables the identification of the
model’s pitfalls. For example, unconfident predictions can indicate an incom-
plete training dataset. It gives insights regarding the knowledge captured by
the model, and can be used to extend the training set with supplementary data,
if needed. High uncertainty can also reveal anomalies within the input data,
which is critical for Quality Control (QC). Overall, UQ increases trust in the
algorithm, and facilitates the interaction between the algorithm and the user.
Moreover, UQ benefits from strong theoretical foundations and has emerged,
from the clinical point of view, as one of the expected property of a deployed AI
algorithm (Tonekaboni et al., 2019). As a result, the medical-imaging commu-
nity is becoming increasingly interested in incorporating UQ to image processing
pipelines in order to highlight model failures or weaknesses. In this work, we
propose a comprehensive overview of such an UQ integration in medical image
processing pipelines.
2
1.1 Research Outline
Several review articles focusing on uncertainty in DL can be found in the liter-
ature. In Abdar et al. (2021a), authors propose a complete review of UQ meth-
ods, as well as their various concrete applications. Hüllermeier and Waegeman
(2021) focus their article on the definition of the two main categories of uncer-
tainty, namely aleatoric and epistemic uncertainties, in the context of machine
learning applications. In Gawlikowski et al. (2021), insights about the various
sources of uncertainty are presented. Reviews focusing on Bayesian DL (Jospin
et al., 2022; Wang and Yeung, 2020) and prediction intervals Kabir et al. (2018)
have been also published. More recently, Zhou et al. (2022) present a review of
the latest advances considering epistemic uncertainty quantification in DL from
the perspective of generalization error. While these various works propose a
complete overview of UQ methods in DL from a general point of view, we have
noticed the lack of reviews focusing on medical image processing applications,
where being able to correctly identify the confidence of the model is crucial.
Kurz et al. (2022) presented a first work in this direction, using a corpus of 22
papers. Their study, however, is restricted to medical image classification. With
the present review, we propose to extend the latter by presenting a complete
review of 130 peer-reviewed papers implementing UQ applications in supervised
DL-based pipelines, for both medical image classification and segmentation. We
also aim at providing an in-depth discussion of UQ methods’ evaluation proce-
dures, as well as pointing out the challenges of the field and potential future
directions. Our review differentiates from other previously published ones by
the following contributions:
A review of UQ methods dedicated to DL medical image processing clas-
sification and segmentation.
A focus on the proposed metrics for uncertainty estimates evaluation.
Discussion on the current challenges and limitations of UQ for medical
image analysis, and suggestion of future work directions.
1.2 Organization of this Review
This report is divided into four sections. Section 2 introduces the key concepts
addressed in this study, namely the application of DL models to medical image
classification and segmentation (subsection 2.2), as well as the main notions of
UQ (subsection 2.3). Section 3 presents the most popular UQ methods applied
in the context of medical image analysis. Section 4 then focuses on the evalua-
tion procedures that can be implemented to assess the usefulness of uncertainty
estimates. Finally, Section 5 proposes a discussion of the current challenges and
gaps in the literature in the field of UQ for DL medical image processing.
3
2 Framework
2.1 Problem setting
In this work, we focus on supervised learning approaches. With this classical
setting, the goal of the DL algorithm is to learn a task Tbased on a training
dataset composed of pairs of input images x, and their associated ground truth
y. This target represents a class in the context of classification (e.g.,healthy,
pathological), whereas it consists in a mask for segmentation tasks (e.g., the
manual delineation of tumors). By observing multiple examples of pairs of im-
ages and their corresponding labels during training, the learning agent estimates
the mapping function p(y|x)from the data.
2.2 Deep Learning for medical image analysis
The common approach for supervised DL medical image processing is the train-
ing of a Convolutional Neural Network (CNN) using an annotated dataset (i.e.
the ground truth). The building block of CNNs is the convolutional layer, which
convolves the input data with learnable weighted kernels. This enables the ex-
traction of features within the image, while being insensitive to the position,
scale and shape.
For medical image classification, popular convolutional architectures com-
prises Residual and Dense CNNs (Huang et al., 2017) or EfficientNets (Tan and
Le, 2019). These architectures consist of a succession of convolutional layers
that extract features from the image at different scales while reducing its size,
thus its spatial resolution. For medical image segmentation, popular choices in-
clude U-Net (Ronneberger et al., 2015) and its variants, such as Residual U-Net
(Kerfoot et al., 2018), V-Net (Milletari et al., 2016), Attention U-Net (Oktay
et al., 2018) or Dynamic U-Net (Isensee et al., 2021). These segmentation mod-
els are composed of two branches, an encoder and a decoder, forming the U
shape. The encoder compresses the dimension of the input image, while the
decoder decompresses the signal until it recovers its original size. Between the
two modules, skip connections are usually added so that the features learned
in the encoder part can be used to generate the segmentation in the decoder
part. Similarly to medical images that can be either 2-dimensional (e.g. 2D
CT, Optical coherence tomography (OCT), microscopy or colonoscopy) or 3D
(e.g MRI, 3D CT, PET...), the CNNs can be implemented in 2D or 3D.
During the supervised training stage, the CNN uses images from the training
set to produces predictions, which are compared to the ground truth targets in
order to estimate the error of the model. To do so, a loss function is introduced
to estimate the discrepancy between predicted and true labels. Standard choices
for both image classification and segmentation include the cross-entropy loss or
focal loss (Lin et al., 2020). For segmentation tasks, specific loss functions can
also be used such as the popular Dice loss (Milletari et al., 2016) and variants
(Generalized Dice loss (Fidon et al., 2017) or Tversky loss (Salehi et al., 2017)).
In the context of medical image classification, CNNs provide a categorical
4
probability distribution over the different observable classes, by applying a soft-
max function on the model’s output. The final assigned class corresponds to
the one having the highest probability. The same process is applied for medical
image segmentation, except that the CNN predicts one class per pixel or voxel.
UQ aims at completing these predictions with uncertainty estimates, allowing
a better interpretation of the results with respect to the model’s confidence. In
the following section, the main concepts of uncertainty are introduced.
2.3 The specific language of uncertainty
Predictive uncertainty, meaning the uncertainty associated with the prediction
of a DL model, is typically divided in two parts: model (or epistemic) and data
(or aleatoric) uncertainty.
Epistemic uncertainty describes uncertainty arising from the lack of knowl-
edge about the perfect predictor, considering the current input (Hüllermeier and
Waegeman, 2021). In complex scenarios, there is often not a single model, but
rather a multitude of models that can explain the observed data (Gal et al.,
2016). Thus, uncertainty arises regarding the choice of the model parameters.
Epistemic uncertainty is considered to be reducible, meaning that it can be re-
duced by using additional data. In practice, epistemic uncertainty is expected
to be high for images far from the training data distribution (referred to as out-
of-distribution (OOD) samples). Such discrepancy between test and training
datasets is frequent in medical image analysis, where there may be significant
variations between images acquired at different hospitals or using different ma-
chines. Additionally, unexpected patterns can be encountered in test images,
such as diseases not encountered during training, or artifacts. Popular ap-
proaches to improve the generalizability of models to unseen domains include
data augmentation (Chen et al., 2020; Ouyang et al., 2021; Zhang et al., 2020)
or transfer learning (Ghafoorian et al., 2017).
Aleatoric uncertainty describes intrinsic noise and random effects within the
data (Hüllermeier and Waegeman, 2021). It is not intrinsic to the model, but
rather a property of the underlying generative distribution of the data. In the
context of classification or segmentation, aleatoric uncertainty increases when
the number of classes is high and when these classes are fine-grained (Malinin,
2019). Aleatoric uncertainty is considered to be irreducible, meaning that it
cannot be reduced with more data. Actually, the only way to diminish aleatoric
uncertainty would be to increase the measurement system precision to reduce
noise that corrupts the dataset (Gal et al., 2016). Finally, aleatoric uncertainty
can be further split into two categories: homoscedastic uncertainty, which is
identical for each sample of the dataset, and heteroscedastic uncertainty, which
depends on the query input.
Lastly, closely linked to this notion of data uncertainty, the notion of la-
bel uncertainty was introduced for segmentation tasks. It has been observed
that inter-rater variability in the context of manual delineations of medical im-
ages was important (Becker et al., 2019; Joskowicz et al., 2019). This has a
direct impact on the model’s overall uncertainty as the same object of interest
5
摘要:

TrustworthyclinicalAIsolutions:auniedreviewofuncertaintyquanticationindeeplearningmodelsformedicalimageanalysisBenjaminLambert1,3,FlorenceForbes2,AlanTucholka3,SenanDoyle3,HarmonieDehaene3andMichelDojat11Univ.GrenobleAlpes,Inserm,U1216,GrenobleInstitutdesNeurosciences,Grenoble,38000,France2Univ.Gr...

展开>> 收起<<
Trustworthy clinical AI solutions a unified review of uncertainty quantification in deep learning models for medical image analysis.pdf

共40页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:40 页 大小:921.31KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 40
客服
关注