Biling Wang and Michael Dohopolski et. al. Performance Deterioration of Deep Learning Models after Clinical Deployment: A Case Study with
Auto-segmentation for Definitive Prostate Cancer Radiotherapy
1
Performance Deterioration of Deep Learning Models after
Clinical Deployment: A Case Study with Auto-segmentation for
Definitive Prostate Cancer Radiotherapy
Biling Wang1,3,^, Michael Dohopolski1,2,^, Ti Bai1,2, Junjie Wu1,2, Raquibul Hannan1,2, Neil Desai1,2,
Aurelie Garant1,2, Daniel Yang1,2, Dan Nguyen1,2, Mu-Han Lin1,2, Robert Timmerman1,2, Xinlei
Wang3,4,*, Steve Jiang1,2,*
1Medical Artificial Intelligence and Automation Laboratory, University of Texas Southwestern Medical
Center, Dallas, Texas, USA
2Department of Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, Texas,
USA
3Department of Statistical Science, Southern Methodist University, Dallas, Texas, USA
4Department of Mathematics, University of Texas at Arlington, Dallas, Texas, USA
*Co-Correspondence Authors. Email: xinlei.wang@uta.edu, steve.jiang@utsouthwestern.edu
^ Co-first authors.
Abstract
Background Deep learning (DL)-based artificial intelligence (AI) has made significant strides in
the medical domain. There's a rising concern that over time, AI models may lose their
generalizability, especially with new patient populations or shifting clinical workflows. Thus we
evaluated the temporal performance of our DL-based prostate radiotherapy auto-segmentation
model, seeking to correlate its efficacy with changes in clinical landscapes.
Methods We retrospectively simulated the clinical implementation of our DL model to investigate
temporal performance patterns. Our study involved 1328 prostate cancer patients who underwent
definitive radiotherapy between January 2006 and August 2022 at the University of Texas
Southwestern Medical Center (UTSW). We trained a U-Net-based auto-segmentation model on
data obtained between 2006 and 2011 and tested it on data obtained between 2012 and 2022,
simulating the model’s clinical deployment starting in 2012. We measured the model’s performance
using the Dice similarity coefficient (DSC), visualized the trends in contour quality using
exponentially weighted moving average (EMA) curves. Additionally, we performed Wilcoxon Rank
Sum Test to analyze the differences in DSC distributions across distinct periods, and multiple linear
regression to investigate the impact of various clinical factors.
Findings During the initial deployment of the model from 2012 to 2014, it exhibited peak
performance for all three organs, i.e., prostate, rectum, and bladder. However, after 2015, there
was a pronounced decline in the EMA DSC for the prostate and rectum, while the bladder contour
quality remained relatively stable. Key factors that impacted the prostate contour quality included
physician contouring styles, the use of various hydrogel spacer, CT scan slice thickness, MRI-
guided contouring, and the use of intravenous (IV) contrast. Rectum contour quality was notably
influenced by factors such as slice thickness, physician contouring styles, and the use of various
hydrogel spacers. The quality of the bladder contour was primarily affected by the use of IV contrast.
Interpretation Our study highlights the inherent challenges of maintaining consistent AI model
performance in a rapidly evolving field like clinical medicine. Temporal changes in clinical practices,
personnel shifts, and the introduction of new techniques can erode a model's effectiveness over
time. Although our prostate radiotherapy auto-segmentation model initially showed promising
results, its performance declined with the evolution of clinical practices. Nevertheless, by
integrating updated data and recalibrating the model to mirror contemporary clinical practices, we
can revitalize and sustain its performance, ensuring its continued relevance in clinical
environments.