to be analyzed like speech recognition, language modeling etc.
However, RNNs, by their nature, cannot remember long term
dependencies in a sequence [7]. LSTMs are a special kind of
RNN that are architected to remember long term dependencies
[5]. An LSTM unit consists of a cell and three gates: input,
output and forget. The cell remembers information at each
time step and the gates control the flow of information, that
is, to either pass on or discard information to the next time
step.
Zhang et al. [8] use an LSTM based model to generate rep-
resentations for a cohort of Parkinson’s disease patients. The
input is a temporally ordered list of features {x1, x2...xNp}
at different times, extracted from the patient’s EHR. A set of
features are selected as prediction targets. At each time step ti,
the input is passed through two LSTM layers and the hidden
state output of the final LSTM layer is used for calculating
the loss functions.
An LSTM network is unidirectional, as in, it only preserves
information from the past because it has only seen that part
of the sequence. A bidirectional LSTM (BiLSTM), on the
other hand, sees the input both ways, that is, backwards (past)
to forward (future) and forwards (future) to backward (past).
Therefore, at any given time, we are able to use information
from both the past and the future [6]. BiLSTM has been
successfully used to analyze a patient’s neuropsychological
test scale data, genetic data and tomographic data in first, six
and twelve months used to predict Alzheimer’s Disease [9].
Attention has been used in addition to an LSTM network in
[10]. It works by extracting the hidden states from an LSTM
network and calculating an attention score αtthat can help
weight the inputs by training an additional attention layer. We
can later extract these attention weights during inference to
understand which part of the input was given a higher weight
during classification.
2D Convolutional Neural Networks (CNN) have been pri-
marily used for computer vision applications, where multiple
filters are trained to detect different input image features. 1D
CNNs have been shown to work on time series problems
like longitudinal EHR data [11]. 1D Convolution works over
the temporal dimension with different filter sizes, where the
different filters learn different temporal patterns. This process
produces feature vectors which are then passed through a non-
linear layer like a Rectified Linear Unit (ReLU) or Tanh.
Gradient-weighted Class Activation Mapping (GradCAM)
has been used to examine 1D CNN based models that analyze
protein sequences and find regions in the input sequences that
help the model make the correct prediction [12]. GradCAM
is generally used in computer vision to generate localization
maps for a given concept (class) in the input image [13]. These
maps are made by finding the gradient of the predicted class
in the activation map of the final layer, pooling them channel
wise, and the resultant activation channels are weighted with
the corresponding gradients which can then be inspected to
find which parts of the input helped in the classification.
Therefore, deep learning in general has been shown to
provide value while analyzing clinical EHR data in a variety
of areas. In the following sections, we define our methodology
for analyzing the same in the N3C cohort.
III. METHODOLOGY
A. Dataset
The N3C data transfer to NCATS is performed under a
Johns Hopkins University Reliance Protocol # IRB00249128
or individual site agreements with NIH. The N3C repository
contains N=14,026,265 number of patients out of which
X=5,409,269 are COVID-19 positive [14]. COVID cases are
defined as per CDC guidance [15]. We construct our Long
COVID positive cohort using patients with an existing U09.9
code or a long COVID clinic visit. The controls were con-
structed by choosing 5 random patients with the same site
and within 90 days of the long COVID patient. At the end,
we have 49,950 total patients, of which there were 7,511 Long
COVID patients and 38,649 Control patients.
B. Data pre-processing
The N3C EHR repository contains all historical medical
diagnosis codes stored using the Systematic Nomenclature
of Medicine - Clinical Terms (SNOMED-CT) vocabulary for
all patients. SNOMED-CT is a clinical terminology, widely
used by healthcare providers for documentation and reporting
within health systems [16]. Therefore, for each patient, we
have a list of these diagnosis codes along with the date when
the code was recorded. Since our goal is to find risk factors
that can pre-dispose a patient to suffer from Long COVID,
we focus on all conditions, not including the Long COVID
diagnosis, in the patient’s diagnostic history up to 45 days
post the first COVID diagnosis or positive test, which we used
as the acute phase cut off. We arrange all these conditions
in an ordered list from the earliest to the latest. We also
ensure that we insert only one record in the ordered list for
all conditions that were repeatedly recorded in a single day,
which can occur when one patient can have multiple tests
or diagnoses in a single day. At the end of this process, each
patient pihas an ordered list of diagnosis codes [d1, d2, ...dK],
where K= 1000. We select K= 1000 as the upper limit for
the length of the list of diagnosis codes as we found that 99%
of our patients had less than 1000 diagnosis codes present in
their medical history. We add padding using ”padding tokens”
([P AD]) to make all inputs of length shorter than 1000 of
uniform length. For those conditions for which we do not have
prior embeddings, we replace with [UNK]tokens.
C. Pre-trained SNOMED-CT embeddings
Prior work focuses on embedded vector representation
learning to make medical concepts analyzable via mathemat-
ical models and subsequently building models for analysis
[3]. To analyze these temporal patterns in an ordered list of
concept codes using deep learning, we first have to transform
them into their equivalent vector representations that also
capture semantic meaning and similarities between different
diagnoses. We used 200-dimensional SNOMED embeddings
trained using SNOMED2Vec, a graph based representation