Biling Wang and Michael Dohopolski et. al. Performance Deterioration of Deep Learning Models after Clinical Deployment A Case Study with Auto-segmentation for Definitive Prostate Cancer Radiotherapy 1_2

2025-04-27 0 0 1.38MB 18 页 10玖币
侵权投诉
Biling Wang and Michael Dohopolski et. al. Performance Deterioration of Deep Learning Models after Clinical Deployment: A Case Study with
Auto-segmentation for Definitive Prostate Cancer Radiotherapy
1
Performance Deterioration of Deep Learning Models after
Clinical Deployment: A Case Study with Auto-segmentation for
Definitive Prostate Cancer Radiotherapy
Biling Wang1,3,^, Michael Dohopolski1,2,^, Ti Bai1,2, Junjie Wu1,2, Raquibul Hannan1,2, Neil Desai1,2,
Aurelie Garant1,2, Daniel Yang1,2, Dan Nguyen1,2, Mu-Han Lin1,2, Robert Timmerman1,2, Xinlei
Wang3,4,*, Steve Jiang1,2,*
1Medical Artificial Intelligence and Automation Laboratory, University of Texas Southwestern Medical
Center, Dallas, Texas, USA
2Department of Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, Texas,
USA
3Department of Statistical Science, Southern Methodist University, Dallas, Texas, USA
4Department of Mathematics, University of Texas at Arlington, Dallas, Texas, USA
*Co-Correspondence Authors. Email: xinlei.wang@uta.edu, steve.jiang@utsouthwestern.edu
^ Co-first authors.
Abstract
Background Deep learning (DL)-based artificial intelligence (AI) has made significant strides in
the medical domain. There's a rising concern that over time, AI models may lose their
generalizability, especially with new patient populations or shifting clinical workflows. Thus we
evaluated the temporal performance of our DL-based prostate radiotherapy auto-segmentation
model, seeking to correlate its efficacy with changes in clinical landscapes.
Methods We retrospectively simulated the clinical implementation of our DL model to investigate
temporal performance patterns. Our study involved 1328 prostate cancer patients who underwent
definitive radiotherapy between January 2006 and August 2022 at the University of Texas
Southwestern Medical Center (UTSW). We trained a U-Net-based auto-segmentation model on
data obtained between 2006 and 2011 and tested it on data obtained between 2012 and 2022,
simulating the model’s clinical deployment starting in 2012. We measured the model’s performance
using the Dice similarity coefficient (DSC), visualized the trends in contour quality using
exponentially weighted moving average (EMA) curves. Additionally, we performed Wilcoxon Rank
Sum Test to analyze the differences in DSC distributions across distinct periods, and multiple linear
regression to investigate the impact of various clinical factors.
Findings During the initial deployment of the model from 2012 to 2014, it exhibited peak
performance for all three organs, i.e., prostate, rectum, and bladder. However, after 2015, there
was a pronounced decline in the EMA DSC for the prostate and rectum, while the bladder contour
quality remained relatively stable. Key factors that impacted the prostate contour quality included
physician contouring styles, the use of various hydrogel spacer, CT scan slice thickness, MRI-
guided contouring, and the use of intravenous (IV) contrast. Rectum contour quality was notably
influenced by factors such as slice thickness, physician contouring styles, and the use of various
hydrogel spacers. The quality of the bladder contour was primarily affected by the use of IV contrast.
Interpretation Our study highlights the inherent challenges of maintaining consistent AI model
performance in a rapidly evolving field like clinical medicine. Temporal changes in clinical practices,
personnel shifts, and the introduction of new techniques can erode a model's effectiveness over
time. Although our prostate radiotherapy auto-segmentation model initially showed promising
results, its performance declined with the evolution of clinical practices. Nevertheless, by
integrating updated data and recalibrating the model to mirror contemporary clinical practices, we
can revitalize and sustain its performance, ensuring its continued relevance in clinical
environments.
Biling Wang and Michael Dohopolski et. al. Performance Deterioration of Deep Learning Models after Clinical Deployment: A Case Study with
Auto-segmentation for Definitive Prostate Cancer Radiotherapy
2
Funding This study is supported by NIH grants R01CA237269, R01CA254377, and
R01CA258987.
Keywords: Deep Learning; Segmentation; Model Performance Deterioration; Radiotherapy
Research in context
Evidence before this study
We searched across PubMed, IEEE Xplore, and Web of Science using keywords such as "artificial
intelligence," "deep learning," "machine learning," "model performance deterioration," "performance
change," "performance decrease," "performance shift," "calibration drift," and "medicine," without
imposing language restrictions in 2021. Numerous studies were identified that discuss or address
issues of deep learning model generalizability; however, most were confined to spatial domain. For
example, models trained and validated with data from a single institution often underperform when
applied elsewhere. The literature in the medical field tends to overlook the potential for performance
degradation of models post-deployment, even when they are used within the same institution. Only
three papers marginally acknowledged performance deterioration in their machine learning models, but
they did not investigate the causes. Two papers highlighted the importance of ongoing monitoring and
updating of AI algorithms in healthcare but fell short of examining how and why model performance may
alter over time. There was no dedicated study analyzing the reasons for performance changes in deep
learning models after their initial deployment.
Added value of this study
Our study is pioneering in its demonstration that a model's performance can decline significantly over
time. We have identified specific variables that contribute to shifts in data distribution, leading to a
decrease in performance. Changes in clinical practices, staffing alterations, and the introduction of novel
medical techniques can all contribute to diminishing a model's accuracy. By incorporating insights from
post-deployment data, we have biennially updated our model and observed consistent enhancements
in performance after each update. Recognizing this caveat is critical for the successful clinical
deployment of deep learning models.
Implications of all the available evidence
The implications of our findings extend beyond the particular model and clinical task investigated in this
study; they are relevant to various models used in the medical field. Our research underlines the
necessity of establishing protocols for constant monitoring and optimization of these models to maintain
their effectiveness and value in patient care across multiple medical settings.
1. Introduction
Over the last ten years, artificial intelligence (AI), driven by deep learning (DL) techniques, has achieved
remarkable progress, particularly in areas such as computer vision (CV) and natural language
processing (NLP), leading to transformative developments across numerous applications. This surge
has led to significant enthusiasm in the medical realm, and DL-related medical publications have been
Biling Wang and Michael Dohopolski et. al. Performance Deterioration of Deep Learning Models after Clinical Deployment: A Case Study with
Auto-segmentation for Definitive Prostate Cancer Radiotherapy
3
growing exponentially since 2015.1 However, despite the promising prospects of DL in the medical field,
its practical deployment remains constrained.2
This lack of clinical translation is multifactorial. First, the interpretability of many DL models remains a
challenge;3, 4 therefore, clinicians are rightly skeptical when assessing whether a model can be
appropriately applied to patient care.5-9 This skepticism is not unfounded, as issues of generalizability
persist in many DL models.10-20 For instance, a model trained and validated on data from one institution
may fail when implemented at another.21 Another critical issue, often overlooked in medical literature, is
the potential for a model’s performance to degrade after initial deployment.22-26 The decline in model
performance post-clinical deployment can often be attributed to data drift, such as variations in imaging
acquisition protocols over time within the institution, and evolving practice patterns as new faculty join.27
In one of the first clinically-oriented studies evaluating a model's performance, Davis et al.22 observed
a temporal decline in their model’s ability to predict acute kidney injury. While they attributed this decline
to calibration drift, they did not explore the underlying factors in detail. Similarly, Nestor et al.25 also
noted temporal performance changes when predicting mortality and prolonged length of stay. Clearly,
there is a pressing need for further research to explore how and why a DL model's performance may
deteriorate over time. In this study, we have observed a temporal decrease in the accuracy of our
automated prostate segmentation model. Furthermore, we investigated the potential impact of evolving
clinical workflows on this observed decline in model performance. We found that by refreshing the model
with recent data, we were able to enhance its accuracy.
2. Methods and Materials
We retrospectively simulated the clinical implementation of our DL model to investigate temporal
performance patterns. Our study involved 1328 prostate cancer patients who underwent definitive
external beam radiotherapy (EBRT) between January 2006 and August 2022 at the University of Texas
Southwestern Medical Center (UTSW). We trained a U-Net-based auto-segmentation model on data
obtained between 2006 and 2011 and tested it on data obtained between 2012 and 2022, simulating
the model’s clinical deployment starting in 2012. We measured the model’s performance using the Dice
similarity coefficient (DSC), visualized the trends in contour quality (DSC) using exponentially weighted
moving average (EMA) curves. Additionally, we performed Wilcoxon Rank Sum Test to analyze the
differences in DSC distributions across distinct periods, and multiple linear regression to investigate the
impact of various clinical factors.
Dataset
In this single-institutional study approved by our institutional review board, we identified 1480 patients
at UTSW diagnosed with prostate cancer and treated with definitive EBRT from January 2006 to August
2022. EBRT treatment regimens included conventional, moderately hypofractionated, or ultra-
fractionated radiotherapy, also known as stereotactic body radiotherapy (SBRT). All patients had
delineated contours on radiotherapy planning computed tomography (CT) for the prostate, rectum, or
bladder. We excluded prostate contours that incorporated the seminal vesicles. To be included, patients
were required to have at least a prostate, bladder, or rectum contour. Moreover, patients were excluded
if significant artifact was observed. Our final cohort comprised 1,328 patients (Figure 1).
Biling Wang and Michael Dohopolski et. al. Performance Deterioration of Deep Learning Models after Clinical Deployment: A Case Study with
Auto-segmentation for Definitive Prostate Cancer Radiotherapy
4
Within the final cohort, 982 had well-defined prostate contours, 1269 had available rectum contours,
and 1277 had available bladder contours. One hundred and sixty three (163) patients were treated
between 2006 and 2011, 203 patients between 2012 and 2014, 602 between 2015 and 2019, and 360
patients were treated between 2020 and 2022.
Figure 1. Data selection flow chart
We extracted variables including treating physicians, slice thicknesses for the CT scans, types of
hydrogel spacers, either non-contrast (Type I spacer) or contrast-enhancing (Type II spacer), use of
intravenous (IV) contrast at time of CT sim (measured by evaluating IV contrast was present in the
bladder), and Magnetic Resonance Imaging guided (MRI-guided) contouring techniques. The detailed
information of these variables can be seen in section 3.2.
Model Training
We utilized a 3D U-Net-based auto-segmentation model to contour the prostate, rectum, and bladder
on CT images intended for radiotherapy planning. Our implementation was based on the open-source
MONAI U-Net.28 The model was trained using the Adam optimizer with default hyperparameters (
 and  ) over    iterations, leveraging the dice loss function. We initiated the learning
rate at   , reducing it to    at the   th iteration and    at the    iteration,
respectively. We set the batch size to one. The model was trained and validated using data from 163
patients treated before 2012.
Longitudinal performance evaluation
摘要:

BilingWangandMichaelDohopolskiet.al.PerformanceDeteriorationofDeepLearningModelsafterClinicalDeployment:ACaseStudywithAuto-segmentationforDefinitiveProstateCancerRadiotherapy1PerformanceDeteriorationofDeepLearningModelsafterClinicalDeployment:ACaseStudywithAuto-segmentationforDefinitiveProstateCance...

展开>> 收起<<
Biling Wang and Michael Dohopolski et. al. Performance Deterioration of Deep Learning Models after Clinical Deployment A Case Study with Auto-segmentation for Definitive Prostate Cancer Radiotherapy 1_2.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:1.38MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注