HOW PRECISE ARE PERFORMANCE ESTIMATES FOR TYPICAL MEDICAL IMAGE SEGMENTATION TASKS Rosana El Jurdi Olivier Colliot

2025-05-06 0 0 275.34KB 5 页 10玖币
侵权投诉
HOW PRECISE ARE PERFORMANCE ESTIMATES FOR TYPICAL MEDICAL IMAGE
SEGMENTATION TASKS?
Rosana El Jurdi, Olivier Colliot
Sorbonne Universit´
e, Institut du Cerveau - Paris Brain Institute - ICM, CNRS, Inria, Inserm, AP-HP,
Hˆ
opital de la Piti´
e Salpˆ
etri`
ere, F-75013, Paris, France
ABSTRACT
An important issue in medical image processing is to be able
to estimate not only the performances of algorithms but also
the precision of the estimation of these performances. Report-
ing precision typically amounts to reporting standard-error of
the mean (SEM) or equivalently confidence intervals. How-
ever, this is rarely done in medical image segmentation stud-
ies. In this paper, we aim to estimate what is the typical con-
fidence that can be expected in such studies. To that end, we
first perform experiments for Dice metric estimation using
a standard deep learning model (U-net) and a classical task
from the Medical Segmentation Decathlon. We extensively
study precision estimation using both Gaussian assumption
and bootstrapping (which does not require any assumption on
the distribution). We then perform simulations for other test
set sizes and performance spreads. Overall, our work shows
that small test sets lead to wide confidence intervals (e.g. 8
points of Dice for 20 samples with σ10).
Index TermsSegmentation, Performance, Validation,
Statistical analysis, Confidence interval, Standard error.
1. INTRODUCTION
In medical imaging, it is not uncommon that sample sizes are
in the order of dozens of subjects, at best hundreds or thou-
sands. In 3D medical image segmentation, the size of the set
used to evaluate the performance may be even smaller than
for other medical imaging tasks as obtaining the ground truth
requires voxel-wise annotation by trained raters.
Intuitively, the precision of the estimation of the perfor-
mance depends on two factors: the variability of the perfor-
mance among the test set (the more variable, the less precise)
and the size of the test set (smaller sets will lead to lower pre-
cision and therefore larger confidence intervals). However,
papers usually report the average performance for different
metrics (e.g. average Dice) but not the precision 1with which
this average performance is estimated. Such precision can be
provided in the form of confidence intervals or equivalently
1Throughout the paper, precision means how precise are the estimates of
the performance. It has nothing to do with the performance metric Precision
also known as Positive Predictive Value.
standard error of the mean (SEM) which are not often re-
ported. What is more often reported is the empirical standard
deviation over different folds of a cross-validation. While this
may qualitatively characterize the variability of the learning
procedure when the training and testing set change, it should
never be used to compute the SEM, since here nwould be the
number of folds or splits, which is arbitrary and can be made
as large as one wants, thereby making the confidence interval
arbitrarily narrow. It is not even an unbiased estimate of the
standard deviation of the performance metric [1].
Quantifying the precision of the estimation of the per-
formances thus requires an independent test set, on which
confidence intervals or SEM are reported. Since this is not
typically done in medical image segmentation papers, one
may ask the following question. What precision can be ex-
pected for a typical sample size? How trustworthy are the av-
erage performance estimates (for instance Dice coefficients)
reported in medical image segmentation papers?
Surprisingly, this question has been little studied in med-
ical imaging. In the case of a different task, namely image
classification, it is necessary to have large sample sizes for a
precise estimation of the accuracy (typically 10,000 samples
to achieve a 1%-wide confidence interval given an accuracy
of about 90% 95%) [2, 3]. However, to the best of our
knowledge, this is not widely known in the case of segmen-
tation. We hypothesize that the test size needed to achieve
a given precision is lower than for classification due to the
continuous nature of performance measures [4].
Our objective is to study the precision that can be expected
in 3D medical image segmentation for typical test set sizes.
We first conduct experiments using a standard deep learning
network applied to a classical segmentation task from the Seg-
mentation Decathlon Challenge [5] in order to estimate confi-
dence intervals which are obtained for variable test set sizes.
We then perform simulations for other sizes and spreads. We
insist that the aim of the present paper is not to propose a
new segmentation methodology. Instead, the main aims are
to provide information regarding the confidence intervals that
can typically be expected in medical image segmentation and
to raise awareness of the community on this important issue.
1
arXiv:2210.14677v3 [cs.CV] 24 May 2023
摘要:

HOWPRECISEAREPERFORMANCEESTIMATESFORTYPICALMEDICALIMAGESEGMENTATIONTASKS?RosanaElJurdi,OlivierColliotSorbonneUniversit´e,InstitutduCerveau-ParisBrainInstitute-ICM,CNRS,Inria,Inserm,AP-HP,HˆopitaldelaPiti´eSalpˆetri`ere,F-75013,Paris,FranceABSTRACTAnimportantissueinmedicalimageprocessingistobeabletoe...

展开>> 收起<<
HOW PRECISE ARE PERFORMANCE ESTIMATES FOR TYPICAL MEDICAL IMAGE SEGMENTATION TASKS Rosana El Jurdi Olivier Colliot.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:275.34KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注