Analyzing historical diagnosis code data from NIH N3C and RECOVER Programs using deep learning to determine risk factors for Long Covid

2025-04-27 0 0 3.2MB 6 页 10玖币
侵权投诉
Analyzing historical diagnosis code data from NIH
N3C and RECOVER Programs using deep learning
to determine risk factors for Long Covid
Saurav Sengupta, Johanna Loomba, Suchetha Sharma, Donald E. Brown
Lorna Thorpe§, Melissa A Haendel§, Christopher G Chute§, Stephanie Hong§
on behalf of N3C and RECOVER consortium
Department of Engineering Systems and Environment, University of Virginia
integrated Translational Health Research Institute of Virginia (iTHRIV), University of Virginia
School of Data Science, University of Virginia
§NIH
{ss4yd, jjl4d, ss4jg, deb}@virginia.edu, §lorna.thorpe@nyulangone.org, §melissa@tislab.org, §chute@jhu.edu, shong59@jh.edu
Abstract—Post-acute sequelae of SARS-CoV-2 infection
(PASC) or Long COVID is an emerging medical condition that
has been observed in several patients with a positive diagnosis
for COVID-19. Historical Electronic Health Records (EHR) like
diagnosis codes, lab results and clinical notes have been analyzed
using deep learning and have been used to predict future clinical
events. In this paper, we propose an interpretable deep learning
approach to analyze historical diagnosis code data from the
National COVID Cohort Collective (N3C)1to find the risk factors
contributing to developing Long COVID. Using our deep learning
approach, we are able to predict if a patient is suffering from
Long COVID from a temporally ordered list of diagnosis codes
up to 45 days post the first COVID positive test or diagnosis for
each patient, with an accuracy of 70.48%. We are then able
to examine the trained model using Gradient-weighted Class
Activation Mapping (GradCAM) to give each input diagnoses
a score. The highest scored diagnosis were deemed to be the
most important for making the correct prediction for a patient.
We also propose a way to summarize these top diagnoses for
each patient in our cohort and look at their temporal trends
to determine which codes contribute towards a positive Long
COVID diagnosis.
Index Terms—COVID-19, EHR, deep learning, GradCAM
I. INTRODUCTION
Some people infected with the COVID-19 virus have
demonstrated a wide range of health problems that can last
a long time post infection, which has been termed Long
COVID. According to the World Health Organization (WHO),
approximately 10-20% of the people infected with COVID-19
experience a variety of health conditions in the mid to long
term after they recover from the initial illness. According to the
NIH REsearching COVID to Enhance Recovery (RECOVER)2
program, which seeks to understand, treat, and prevent PASC,
this condition generally refers to ongoing health effects, new
or existing symptoms and other health problems that occur
after the acute phase of SARS-Cov-2 infection (i.e., present
1https://ncats.nih.gov/n3c
2For more information on RECOVER, visit https://recovercovid.org/
four or more weeks after the acute infection). Therefore, it has
become necessary to identify risk factors in a patient’s medical
history that can lead to them experiencing Long COVID.
The N3C repository contains records of patients with a
newly introduced ICD-103U09.9 code (”Post COVID-19 con-
dition”) that is being used to refer to patients being diagnosed
with Long COVID [1]. The N3C repository also contains
conditions, measurements and other medical records for these
patients. We focus our analysis on all the recorded medical
conditions recorded in the form of ICD-10 codes, for these
patients up to 45 days post the first COVID diagnosis.
Previous efforts have focused on feature creation based
on comorbidities, demographics, medication and healthcare
utilization derived from the EHR data to develop machine
learning models that can predict if a patient will develop Long
Covid [2]. In this work, we incorporate all conditions rather
than limiting features to a pre-defined list of comorbidities
to build a deep learning model to capture a more complete
picture of a patient’s medical history to find risk factors asso-
ciated with Long COVID. We test different architectures for
analyzing longitudinal data using all diagnosis codes present
in a patient’s medical history arranged temporally, and then
use interpretability methods to identify conditions associated
with the risk of developing Long COVID.
II. RELATED WORK
Many previous works have analyzed temporal EHR data
in different settings like predicting clinical events and risk
stratification [3]. They have focused on Long Short Term
Memory (LSTM) networks, a form of RNNs, to model
longitudinal data [4]. RNNs are a class of neural network
with looping connections between nodes such that temporal
information persists [5] [6]. This makes them very useful to
analyze time series data or applications where sequences have
3International Classification of Diseases 10th Revision
arXiv:2210.02490v1 [cs.LG] 5 Oct 2022
to be analyzed like speech recognition, language modeling etc.
However, RNNs, by their nature, cannot remember long term
dependencies in a sequence [7]. LSTMs are a special kind of
RNN that are architected to remember long term dependencies
[5]. An LSTM unit consists of a cell and three gates: input,
output and forget. The cell remembers information at each
time step and the gates control the flow of information, that
is, to either pass on or discard information to the next time
step.
Zhang et al. [8] use an LSTM based model to generate rep-
resentations for a cohort of Parkinson’s disease patients. The
input is a temporally ordered list of features {x1, x2...xNp}
at different times, extracted from the patient’s EHR. A set of
features are selected as prediction targets. At each time step ti,
the input is passed through two LSTM layers and the hidden
state output of the final LSTM layer is used for calculating
the loss functions.
An LSTM network is unidirectional, as in, it only preserves
information from the past because it has only seen that part
of the sequence. A bidirectional LSTM (BiLSTM), on the
other hand, sees the input both ways, that is, backwards (past)
to forward (future) and forwards (future) to backward (past).
Therefore, at any given time, we are able to use information
from both the past and the future [6]. BiLSTM has been
successfully used to analyze a patient’s neuropsychological
test scale data, genetic data and tomographic data in first, six
and twelve months used to predict Alzheimer’s Disease [9].
Attention has been used in addition to an LSTM network in
[10]. It works by extracting the hidden states from an LSTM
network and calculating an attention score αtthat can help
weight the inputs by training an additional attention layer. We
can later extract these attention weights during inference to
understand which part of the input was given a higher weight
during classification.
2D Convolutional Neural Networks (CNN) have been pri-
marily used for computer vision applications, where multiple
filters are trained to detect different input image features. 1D
CNNs have been shown to work on time series problems
like longitudinal EHR data [11]. 1D Convolution works over
the temporal dimension with different filter sizes, where the
different filters learn different temporal patterns. This process
produces feature vectors which are then passed through a non-
linear layer like a Rectified Linear Unit (ReLU) or Tanh.
Gradient-weighted Class Activation Mapping (GradCAM)
has been used to examine 1D CNN based models that analyze
protein sequences and find regions in the input sequences that
help the model make the correct prediction [12]. GradCAM
is generally used in computer vision to generate localization
maps for a given concept (class) in the input image [13]. These
maps are made by finding the gradient of the predicted class
in the activation map of the final layer, pooling them channel
wise, and the resultant activation channels are weighted with
the corresponding gradients which can then be inspected to
find which parts of the input helped in the classification.
Therefore, deep learning in general has been shown to
provide value while analyzing clinical EHR data in a variety
of areas. In the following sections, we define our methodology
for analyzing the same in the N3C cohort.
III. METHODOLOGY
A. Dataset
The N3C data transfer to NCATS is performed under a
Johns Hopkins University Reliance Protocol # IRB00249128
or individual site agreements with NIH. The N3C repository
contains N=14,026,265 number of patients out of which
X=5,409,269 are COVID-19 positive [14]. COVID cases are
defined as per CDC guidance [15]. We construct our Long
COVID positive cohort using patients with an existing U09.9
code or a long COVID clinic visit. The controls were con-
structed by choosing 5 random patients with the same site
and within 90 days of the long COVID patient. At the end,
we have 49,950 total patients, of which there were 7,511 Long
COVID patients and 38,649 Control patients.
B. Data pre-processing
The N3C EHR repository contains all historical medical
diagnosis codes stored using the Systematic Nomenclature
of Medicine - Clinical Terms (SNOMED-CT) vocabulary for
all patients. SNOMED-CT is a clinical terminology, widely
used by healthcare providers for documentation and reporting
within health systems [16]. Therefore, for each patient, we
have a list of these diagnosis codes along with the date when
the code was recorded. Since our goal is to find risk factors
that can pre-dispose a patient to suffer from Long COVID,
we focus on all conditions, not including the Long COVID
diagnosis, in the patient’s diagnostic history up to 45 days
post the first COVID diagnosis or positive test, which we used
as the acute phase cut off. We arrange all these conditions
in an ordered list from the earliest to the latest. We also
ensure that we insert only one record in the ordered list for
all conditions that were repeatedly recorded in a single day,
which can occur when one patient can have multiple tests
or diagnoses in a single day. At the end of this process, each
patient pihas an ordered list of diagnosis codes [d1, d2, ...dK],
where K= 1000. We select K= 1000 as the upper limit for
the length of the list of diagnosis codes as we found that 99%
of our patients had less than 1000 diagnosis codes present in
their medical history. We add padding using ”padding tokens”
([P AD]) to make all inputs of length shorter than 1000 of
uniform length. For those conditions for which we do not have
prior embeddings, we replace with [UNK]tokens.
C. Pre-trained SNOMED-CT embeddings
Prior work focuses on embedded vector representation
learning to make medical concepts analyzable via mathemat-
ical models and subsequently building models for analysis
[3]. To analyze these temporal patterns in an ordered list of
concept codes using deep learning, we first have to transform
them into their equivalent vector representations that also
capture semantic meaning and similarities between different
diagnoses. We used 200-dimensional SNOMED embeddings
trained using SNOMED2Vec, a graph based representation
摘要:

AnalyzinghistoricaldiagnosiscodedatafromNIHN3CandRECOVERProgramsusingdeeplearningtodetermineriskfactorsforLongCovidSauravSengupta,JohannaLoombay,SuchethaSharmaz,DonaldE.BrownzLornaThorpex,MelissaAHaendelx,ChristopherGChutex,StephanieHongxonbehalfofN3CandRECOVERconsortiumDepartmentofEngineeringSys...

展开>> 收起<<
Analyzing historical diagnosis code data from NIH N3C and RECOVER Programs using deep learning to determine risk factors for Long Covid.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:3.2MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注