Interpreting County Level COVID-19 Infection and Feature Sensitivity using Deep Learning Time Series Models

2025-05-05 0 0 5.97MB 11 页 10玖币
侵权投诉
Interpreting County Level COVID-19 Infection and
Feature Sensitivity using Deep Learning Time
Series Models
Md Khairul Islam 1, Di Zhu 1, Yingzheng Liu 1, Andrej Erkelens 2, Nick Daniello 2, Judy Fox 1,2
1Computer Science Department, University of Virginia
2School of Data Science, University of Virginia
Charlottesville, USA
Email : {mi3se, yqx8es, yl4dt, wsw3fa, njd9e, cwk9mp}@virginia.edu
Abstract—Interpretable machine learning plays a key role in
healthcare because it is challenging in understanding feature
importance in deep learning model predictions. We propose
a novel framework that uses deep learning to study feature
sensitivity for model predictions. This work combines sensitivity
analysis with heterogeneous time-series deep learning model
prediction, which corresponds to the interpretations of Spatio-
temporal features from what the model has actually learned.
We forecast county-level COVID-19 infection using the Temporal
Fusion Transformer (TFT). We then use the sensitivity analysis
extending Morris Method to see how sensitive the outputs are
with respect to perturbation to our static and dynamic input
features. The significance of the work is grounded in a real-
world COVID-19 infection prediction with highly non-stationary,
finely granular, and heterogeneous data. 1) Our model can
capture the detailed daily changes of temporal and spatial model
behaviors and achieves high prediction performance compared to
a PyTorch baseline. 2) By analyzing the Morris sensitivity indices
and attention patterns, we decipher the meaning of feature
importance with observational population and dynamic model
changes. 3) We have collected 2.5 years of socioeconomic and
health features over 3142 US counties, such as observed cases and
deaths, and a number of static (age distribution, health disparity,
and industry) and dynamic features (vaccination, disease spread,
transmissible cases, and social distancing). Using the proposed
framework, we conduct extensive experiments and show our
model can learn complex interactions and perform predictions
for daily infection at the county level. Being able to model
the disease infection with a hybrid prediction and description
accuracy measurement with Morris index at the county level is a
central idea that sheds light on individual feature interpretation
via sensitivity analysis.
Index Terms—Interpretability, County Level COVID-19, Time
Series Deep Learning, TFT, Sensitivity Analysis, Morris Method.
I. INTRODUCTION
Interpretation of machine learning models has recently [1]
led to numerous research applications of AI for social impact.
This includes direct analysis of model components with casual
inference and uncertainty estimation or studying sensitivity
to input perturbations. Typically a simpler model is easier
to interpret but can result in lower predictive accuracy. One
natural question that arises is how to interpret these complex
deep learning models, which may describe the data better.
One major challenge of interpretability is the gap between
model prediction accuracy and descriptive accuracy in real-
world problems. The latter can be illustrated by a quantifiable
measurement and explanation of the individual feature impor-
tance with regard to the model’s forecast relevancy.
To our knowledge, however, no prior studies have evalu-
ated individual feature importance at the county level using
deep learning and the Morris method. We have been closely
monitoring the scientific literature and identifying reports de-
scribing the community-level impact of COVID-19. A number
of factors contribute to COVID-19 cases and deaths, including
a very diverse set of socioeconomic and geographic-specific
features. A more granular real-time analysis that considers
important county-level factors is lacking and urgently needed.
Furthermore, non-stationary time series (with their distribution
drifting over time) [2] or time series with extreme events
[3] or unknown events like COVID variants are particularly
challenging to model and interpret.
To effectively study county-level input features, we design a
novel method to compute the Morris index but generalize it to
multidimensional spatial and temporal variables. Using a self-
attention-based Temporal Fusion Transformer (TFT) model
[4], we can capture a complex mix and full range of static
and dynamic covariates, known inputs, and other exogenous
time series parameters. We perform individual feature impor-
tance evaluations to identify the most influential features for
prediction and the sensitivity of infected cases. The results
show that the model obtains significant performance and
learns temporal patterns. More significantly, our scaled Morris
index provides sensitivity measurement to individual features
that help policymakers develop effective control strategies in
response to the rapidly evolving pandemic. We have made
our code available on GitHub 1. In summary, we’ve made the
following contributions:
Introduce individual feature sensitivity to forecasting out-
puts with an extended Morris Method for multidimen-
sional spatial and temporal data.
1https://github.com/Data-ScienceHub/gpce-sensitivity
arXiv:2210.03258v1 [cs.LG] 6 Oct 2022
Model heterogeneous time-series prediction and analyze
attention weights for insights on feature importance.
Stratify county-level population characteristics(Age and
Industry segments) from socioeconomic and health data.
The rest of the paper is structured as follows. Section II
discusses the data collection, and feature descriptions. Section
III presents the background on the TFT model architecture
and the Morris method. Section IV describes the data pre-
processing and experimental setups. Section V analyzes the
temporal patterns and feature importance insight from the
TFT. Section VI discusses the sensitivity analysis with the
Morris method. Then Section VII discusses the related works
and Section VIII has the concluding remarks and impact on
possible future works.
II. INPUT DATA AND FEATURES
We collected our dataset for 3142 US counties. They are
from multiple sources, including CDC (Centers for Disease
Control and Prevention), USA Facts [5], Unacast [6]. The
dynamic features include entries from 02-29-2020 to 05-17-
2022. Except for vaccination, where the earliest available data
in CDC [7] was from 12-14-2020 when the US initiated
a nationwide COVID-19 vaccination campaign. In total we
select 9 observed features, static and dynamic, to predict cases
and deaths. Fig.1 summarizes the feature groups with the in-
fluencing factors they capture and which county characteristics
they represent.
Fig. 1: The data: the feature groups and influencing factors.
Table I lists the features used by our model and the
respective sources with descriptions. In particular, age distri-
bution, health disparities, disease spread, social distancing, and
transmissible cases features are collected from the outputs of
the COVID-19 Pandemic Vulnerability Index (PVI) dashboard
[8], maintained by the National Institute of Health (NIH).
III. BACKGROUND AND THEORETICAL FOUNDATION
A. Temporal Fusion Transformer
We used the TFT model [4] to predict daily COVID-19
cases and deaths at the county level. For this work, we
dive deeper into the COVID-19 daily cases prediction and
combine sensitivity analysis of individual features. Figure 2
shows a high-level overview of the work. Gated Residual
Network (GRN) is the building block of TFT and it enables
more efficient use of the model architecture. TFT takes static
metadata, time-varying past inputs, and time-varying known
future inputs. The model inputs are passed through a Variable
Selection Network (VSN) to select the most salient features
and filter out noise.
Fig. 2: A time series forecasting model. Each sliding window
consists of time-sequential data that is split into two parts, the
past, and the future.
Learning significant data points is done by leveraging local
context with LSTM-based sequence to sequence layer. Past
inputs are fed into the encoder, whereas known future inputs
are fed into the decoder. Their outputs go through a static
enrichment layer which enhances temporal features with static
metadata. Following static enrichment, TFT adds a novel
interpretable multi-head self-attention mechanism to better
learn the different temporal patterns. This allows TFT to learn
long-rage dependencies that can be challenging for Recurrent
Neural Network (RNN) based models. Following the self-
attention layer, additional gating layers are added to facilitate
training.
B. Sensitivity Analysis and The Morris Method
Sensitivity Analysis is the study of the input-output rela-
tionship in a computational model [11]. It can identify the im-
portance of each model parameter in determining the outputs.
[12] proposed gradient-based attribution which approximates
the neural network function f(X)around a given input Xby
linear part of the Taylor expansion as
f(X+ ∆X)f(X) + Xf(X)T·X(1)
and they analyzed the network sensitivity by looking at how
small changes Xat the input correlate with changes at the
output. Gradient xif=δf(X)
δgives the linear approximation
of this change for a change in the i-th input token xiR, and
the attribution of how much input token xiaffects the network
output f(X)can be approximated by the L2 norm of xif.
TABLE I: Short description of input feature groups (and target features). Refer to references and Appendix for full details.
Feature Description Data Source Input Type
Cases Daily COVID-19 cases USA Facts [5] Target
Deaths Daily COVID-19 deaths
Age Distribution Percentage of population aged 65 or older SVI [9] StaticHealth Disparities Uninsured population percent and socioeconomic status
Industry Percentage of population in different industry sectors (only used in Section VI) Census Bureau
[10]
Vaccination Percentage of population fully vaccinated CDC [7]
Disease Spread Fraction of total cases from the last 14 days (one incubation period) USA Facts [5]
ObservedTransmissible Cases Population size divided by cases from the last 14 days USA Facts [5]
Social Distancing Change in distance travelled relative to baseline(previous year), based on cell
phone mobility data
Unacast [6]
SinWeekly sin (day of the week/7) Date
Known FutureCosWeekly cos (day of the week/7) Date
Linear Space Unique index for each county. USA Facts [5]
C. Problem Statement
We will use deep learning to study feature sensitivity for
model predictions of COVID-19 infection at the county level.
Given this, we adopted the Morris method [13], a reliable and
efficient sensitivity analysis method that defines the sensitivity
of a model input as the ratio of the change in an output variable
to the change in an input feature. More precisely, given a
model Y=f(X), the sensitivity (or the elementary effect) of
a model input feature xican be defined as
EEi(X) = y(x1, x2, . . . , xi+ ∆, . . . , xk)y(X)
(2)
where Xis a scaled vector of kparameters and is the
change to an input feature. Since elementary effects may
cancel each other out, the mean of the absolute values in
distribution EEi(X), denoted by µ(called the Morris
Index), is recommended because it provides true importance
of features [14].
Algorithm 1: Novel Morris Index Calculation for
Spatio-temporal data
Input: X={x1, x2,...,xk}, target feature xiXwith
dimension [C, T ], model y,
// Xis a set of kinput features, is
the change to xi
1Y=y(x1, x2,...,xi+ ∆,...,xk)
2Y=y(X)
3while t < T do /* Temporal */
// Loop through 640 Days
4while c < C do /* Spatial */
// Loop through 3142 US Counties
5GG+|Y[c][t]Y[c][t]|/*Total Change*/
6cc+ 1
7tt+ 1
8c0
// Calculate normalized Morris Index ˆ
µ
9ˆ
µ=G/(CT∆)
10 return ˆ
µ
The original Morris method was proposed to screen static
input factors (or static features) but this is not conventional
for the time series dataset (or dynamic features) where we
look at time variation and spatial variation with an important
overall influence on the output of COVID-19 cases prediction
using TFT. Hence, we design and implement a revised Morris
Algorithm 1 to handle the Spatio-temporal COVID-19 data
sequences. The algorithm calculates a normalized Morris
Index ˆ
µby dividing the total change to the output Gby
the total number of counties C, the total number of daily
timestamps Tand the change to the input . In this study, C
is the total number of counties and Tis the total number of
daily timestamps between 2-29-2020 and 11-29-2021.
IV. EXPERIMENTAL SETUP
A. Computational Resources
We implement our TFT model with both Tensorflow [4] and
PyTorch [15]. Then we conducted a performance evaluation
of the model training on Google Colab and HPC clusters
including the GPU nodes in Table II. The model training time
is about 30 hours. Each training epoch takes on average 50
minutes on a GPU node with at least 32GB of RAM. Each
Morris runs with a trained model, and with additional feature
analysis that takes around 35 minutes.
TABLE II: Runtime environment and hardware specification.
Driver CUDA Processor NVIDIA GPU
470.82.01 11.4 Intel Xeon
A100-SXM4-40GB
Tesla P100-PCIE
Tesla V100-SXM2
Tesla K80
B. Evaluation Metrics
Our forecasting models are evaluated using the following
metrics. Mean Squared Error (MSE) is used as the loss
function following prior works on COVID-19 forecasting [2],
[16]. Other metrics include Mean Absolute Error (MAE),
Root Mean Square Error (RMSE), Symmetric Mean Absolute
Percentage Error (SMAPE), and Normalized Nash-Sutcliffe
Efficiency (NNSE) [17].
These metrics have been widely used in evaluating regres-
sion model performance [18]. The benefit of using NNSE is its
robustness to error variance. NNSE is 1 for a perfect model.
A model with an error variance equal to that of the observed
time series will give NNSE = 0.5 (NSE=0). When the error
摘要:

InterpretingCountyLevelCOVID-19InfectionandFeatureSensitivityusingDeepLearningTimeSeriesModelsMdKhairulIslam1,DiZhu1,YingzhengLiu1,AndrejErkelens2,NickDaniello2,JudyFox1,21ComputerScienceDepartment,UniversityofVirginia2SchoolofDataScience,UniversityofVirginiaCharlottesville,USAEmail:fmi3se,yqx8es,yl...

展开>> 收起<<
Interpreting County Level COVID-19 Infection and Feature Sensitivity using Deep Learning Time Series Models.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:5.97MB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注