Biling Wang and Michael Dohopolski et. al. Performance Deterioration of Deep Learning Models after Clinical Deployment A Case Study with Auto-segmentation for Definitive Prostate Cancer Radiotherapy 1_2

2025-04-27 0 0 1.38MB 18 页 10玖币

Biling Wang and Michael Dohopolski et. al. Performance Deterioration of Deep Learning Models after Clinical Deployment: A Case Study with

Auto-segmentation for Definitive Prostate Cancer Radiotherapy

Performance Deterioration of Deep Learning Models after

Clinical Deployment: A Case Study with Auto-segmentation for

Definitive Prostate Cancer Radiotherapy

Biling Wang1,3,^, Michael Dohopolski1,2,^, Ti Bai1,2, Junjie Wu1,2, Raquibul Hannan1,2, Neil Desai1,2,

Aurelie Garant1,2, Daniel Yang1,2, Dan Nguyen1,2, Mu-Han Lin1,2, Robert Timmerman1,2, Xinlei

Wang3,4,*, Steve Jiang1,2,*

1Medical Artificial Intelligence and Automation Laboratory, University of Texas Southwestern Medical

Center, Dallas, Texas, USA

2Department of Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, Texas,

USA

3Department of Statistical Science, Southern Methodist University, Dallas, Texas, USA

4Department of Mathematics, University of Texas at Arlington, Dallas, Texas, USA

*Co-Correspondence Authors. Email: xinlei.wang@uta.edu, steve.jiang@utsouthwestern.edu

^ Co-first authors.

Abstract

Background Deep learning (DL)-based artificial intelligence (AI) has made significant strides in

the medical domain. There's a rising concern that over time, AI models may lose their

generalizability, especially with new patient populations or shifting clinical workflows. Thus we

evaluated the temporal performance of our DL-based prostate radiotherapy auto-segmentation

model, seeking to correlate its efficacy with changes in clinical landscapes.

Methods We retrospectively simulated the clinical implementation of our DL model to investigate

temporal performance patterns. Our study involved 1328 prostate cancer patients who underwent

definitive radiotherapy between January 2006 and August 2022 at the University of Texas

Southwestern Medical Center (UTSW). We trained a U-Net-based auto-segmentation model on

data obtained between 2006 and 2011 and tested it on data obtained between 2012 and 2022,

simulating the model’s clinical deployment starting in 2012. We measured the model’s performance

using the Dice similarity coefficient (DSC), visualized the trends in contour quality using

exponentially weighted moving average (EMA) curves. Additionally, we performed Wilcoxon Rank

Sum Test to analyze the differences in DSC distributions across distinct periods, and multiple linear

regression to investigate the impact of various clinical factors.

Findings During the initial deployment of the model from 2012 to 2014, it exhibited peak

performance for all three organs, i.e., prostate, rectum, and bladder. However, after 2015, there

was a pronounced decline in the EMA DSC for the prostate and rectum, while the bladder contour

quality remained relatively stable. Key factors that impacted the prostate contour quality included

physician contouring styles, the use of various hydrogel spacer, CT scan slice thickness, MRI-

guided contouring, and the use of intravenous (IV) contrast. Rectum contour quality was notably

influenced by factors such as slice thickness, physician contouring styles, and the use of various

hydrogel spacers. The quality of the bladder contour was primarily affected by the use of IV contrast.

Interpretation Our study highlights the inherent challenges of maintaining consistent AI model

performance in a rapidly evolving field like clinical medicine. Temporal changes in clinical practices,

personnel shifts, and the introduction of new techniques can erode a model's effectiveness over

time. Although our prostate radiotherapy auto-segmentation model initially showed promising

results, its performance declined with the evolution of clinical practices. Nevertheless, by

integrating updated data and recalibrating the model to mirror contemporary clinical practices, we

can revitalize and sustain its performance, ensuring its continued relevance in clinical

environments.

Biling Wang and Michael Dohopolski et. al. Performance Deterioration of Deep Learning Models after Clinical Deployment: A Case Study with

Auto-segmentation for Definitive Prostate Cancer Radiotherapy

Funding This study is supported by NIH grants R01CA237269, R01CA254377, and

R01CA258987.

Keywords: Deep Learning; Segmentation; Model Performance Deterioration; Radiotherapy

Research in context

Evidence before this study

We searched across PubMed, IEEE Xplore, and Web of Science using keywords such as "artificial

intelligence," "deep learning," "machine learning," "model performance deterioration," "performance

change," "performance decrease," "performance shift," "calibration drift," and "medicine," without

imposing language restrictions in 2021. Numerous studies were identified that discuss or address

issues of deep learning model generalizability; however, most were confined to spatial domain. For

example, models trained and validated with data from a single institution often underperform when

applied elsewhere. The literature in the medical field tends to overlook the potential for performance

degradation of models post-deployment, even when they are used within the same institution. Only

three papers marginally acknowledged performance deterioration in their machine learning models, but

they did not investigate the causes. Two papers highlighted the importance of ongoing monitoring and

updating of AI algorithms in healthcare but fell short of examining how and why model performance may

alter over time. There was no dedicated study analyzing the reasons for performance changes in deep

learning models after their initial deployment.

Added value of this study

Our study is pioneering in its demonstration that a model's performance can decline significantly over

time. We have identified specific variables that contribute to shifts in data distribution, leading to a

decrease in performance. Changes in clinical practices, staffing alterations, and the introduction of novel

medical techniques can all contribute to diminishing a model's accuracy. By incorporating insights from

post-deployment data, we have biennially updated our model and observed consistent enhancements

in performance after each update. Recognizing this caveat is critical for the successful clinical

deployment of deep learning models.

Implications of all the available evidence

The implications of our findings extend beyond the particular model and clinical task investigated in this

study; they are relevant to various models used in the medical field. Our research underlines the

necessity of establishing protocols for constant monitoring and optimization of these models to maintain

their effectiveness and value in patient care across multiple medical settings.

1. Introduction

Over the last ten years, artificial intelligence (AI), driven by deep learning (DL) techniques, has achieved

remarkable progress, particularly in areas such as computer vision (CV) and natural language

processing (NLP), leading to transformative developments across numerous applications. This surge

has led to significant enthusiasm in the medical realm, and DL-related medical publications have been

Biling Wang and Michael Dohopolski et. al. Performance Deterioration of Deep Learning Models after Clinical Deployment: A Case Study with 
Auto-segmentation for Definitive Prostate Cancer Radiotherapy 
3 
 
growing exponentially since 2015.1 However, despite the promising prospects of DL in the medical field, 
its practical deployment remains constrained.2 
This lack of clinical translation is multifactorial. First, the interpretability of many DL models remains a 
challenge;3,  4  therefore,  clinicians  are  rightly  skeptical  when  assessing  whether  a  model  can  be 
appropriately applied to patient care.5-9 This skepticism is not unfounded, as issues of generalizability 
persist in many DL models.10-20 For instance, a model trained and validated on data from one institution 
may fail when implemented at another.21 Another critical issue, often overlooked in medical literature, is 
the potential for a model’s performance to degrade after initial deployment.22-26 The decline in model 
performance post-clinical deployment can often be attributed to data drift, such as variations in imaging 
acquisition protocols over time within the institution, and evolving practice patterns as new faculty join.27  
In one of the first clinically-oriented studies evaluating a model's performance, Davis et al.22 observed 
a temporal decline in their model’s ability to predict acute kidney injury. While they attributed this decline 
to calibration drift, they did not explore the underlying factors in detail. Similarly, Nestor et al.25 also 
noted temporal performance changes when predicting mortality and prolonged length of stay. Clearly, 
there is a pressing need for further research to explore how and why a DL model's performance may 
deteriorate  over  time.  In  this  study,  we  have  observed  a  temporal  decrease  in the  accuracy  of  our 
automated prostate segmentation model. Furthermore, we investigated the potential impact of evolving 
clinical workflows on this observed decline in model performance. We found that by refreshing the model 
with recent data, we were able to enhance its accuracy. 
2. Methods and Materials 
We  retrospectively  simulated  the  clinical  implementation  of  our  DL  model  to  investigate  temporal 
performance  patterns.  Our  study  involved  1328  prostate  cancer  patients  who  underwent  definitive 
external beam radiotherapy (EBRT) between January 2006 and August 2022 at the University of Texas 
Southwestern Medical Center (UTSW). We trained a U-Net-based auto-segmentation model on data 
obtained between 2006 and 2011 and tested it on data obtained between 2012 and 2022, simulating 
the model’s clinical deployment starting in 2012. We measured the model’s performance using the Dice 
similarity coefficient (DSC), visualized the trends in contour quality (DSC) using exponentially weighted 
moving average (EMA) curves. Additionally, we  performed Wilcoxon  Rank Sum Test to  analyze  the 
differences in DSC distributions across distinct periods, and multiple linear regression to investigate the 
impact of various clinical factors.  
Dataset 
In this single-institutional study approved by our institutional review board, we identified 1480 patients 
at UTSW diagnosed with prostate cancer and treated with definitive EBRT from January 2006 to August 
2022.  EBRT  treatment  regimens  included  conventional,  moderately  hypofractionated,  or  ultra-
fractionated  radiotherapy,  also  known  as  stereotactic  body  radiotherapy  (SBRT).  All  patients  had 
delineated contours on radiotherapy planning computed tomography (CT) for the prostate, rectum, or 
bladder. We excluded prostate contours that incorporated the seminal vesicles. To be included, patients 
were required to have at least a prostate, bladder, or rectum contour. Moreover, patients were excluded 
if significant artifact was observed. Our final cohort comprised 1,328 patients (Figure 1). 

Biling Wang and Michael Dohopolski et. al. Performance Deterioration of Deep Learning Models after Clinical Deployment: A Case Study with

Auto-segmentation for Definitive Prostate Cancer Radiotherapy

Within the final cohort, 982 had well-defined prostate contours, 1269 had available rectum contours,

and 1277 had available bladder contours. One hundred and sixty three (163) patients were treated

between 2006 and 2011, 203 patients between 2012 and 2014, 602 between 2015 and 2019, and 360

patients were treated between 2020 and 2022.

Figure 1. Data selection flow chart

We extracted variables including treating physicians, slice thicknesses for the CT scans, types of

hydrogel spacers, either non-contrast (Type I spacer) or contrast-enhancing (Type II spacer), use of

intravenous (IV) contrast at time of CT sim (measured by evaluating IV contrast was present in the

bladder), and Magnetic Resonance Imaging guided (MRI-guided) contouring techniques. The detailed

information of these variables can be seen in section 3.2.

Model Training

We utilized a 3D U-Net-based auto-segmentation model to contour the prostate, rectum, and bladder

on CT images intended for radiotherapy planning. Our implementation was based on the open-source

MONAI U-Net.28 The model was trained using the Adam optimizer with default hyperparameters (

 and  ) over    iterations, leveraging the dice loss function. We initiated the learning

rate at   , reducing it to    at the   th iteration and    at the    iteration,

respectively. We set the batch size to one. The model was trained and validated using data from 163

patients treated before 2012.

Longitudinal performance evaluation

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BilingWangandMichaelDohopolskiet.al.PerformanceDeteriorationofDeepLearningModelsafterClinicalDeployment:ACaseStudywithAuto-segmentationforDefinitiveProstateCancerRadiotherapy1PerformanceDeteriorationofDeepLearningModelsafterClinicalDeployment:ACaseStudywithAuto-segmentationforDefinitiveProstateCance...

展开>> 收起<<

Biling Wang and Michael Dohopolski et. al. Performance Deterioration of Deep Learning Models after Clinical Deployment A Case Study with Auto-segmentation for Definitive Prostate Cancer Radiotherapy 1_2.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Biling Wang and Michael Dohopolski et. al. Performance Deterioration of Deep Learning Models after Clinical Deployment A Case Study with Auto-segmentation for Definitive Prostate Cancer Radiotherapy 1_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: