Analyzing historical diagnosis code data from NIH N3C and RECOVER Programs using deep learning to determine risk factors for Long Covid

2025-04-27 0 0 3.2MB 6 页 10玖币

侵权投诉

Analyzing historical diagnosis code data from NIH

N3C and RECOVER Programs using deep learning

to determine risk factors for Long Covid

Saurav Sengupta∗, Johanna Loomba†, Suchetha Sharma‡, Donald E. Brown‡∗

Lorna Thorpe§, Melissa A Haendel§, Christopher G Chute§, Stephanie Hong§

on behalf of N3C and RECOVER consortium

∗Department of Engineering Systems and Environment, University of Virginia

†integrated Translational Health Research Institute of Virginia (iTHRIV), University of Virginia

‡School of Data Science, University of Virginia

§NIH

{ss4yd, jjl4d, ss4jg, deb}@virginia.edu, §lorna.thorpe@nyulangone.org, §melissa@tislab.org, §chute@jhu.edu, shong59@jh.edu

Abstract—Post-acute sequelae of SARS-CoV-2 infection

(PASC) or Long COVID is an emerging medical condition that

has been observed in several patients with a positive diagnosis

for COVID-19. Historical Electronic Health Records (EHR) like

diagnosis codes, lab results and clinical notes have been analyzed

using deep learning and have been used to predict future clinical

events. In this paper, we propose an interpretable deep learning

approach to analyze historical diagnosis code data from the

National COVID Cohort Collective (N3C)1to ﬁnd the risk factors

contributing to developing Long COVID. Using our deep learning

approach, we are able to predict if a patient is suffering from

Long COVID from a temporally ordered list of diagnosis codes

up to 45 days post the ﬁrst COVID positive test or diagnosis for

each patient, with an accuracy of 70.48%. We are then able

to examine the trained model using Gradient-weighted Class

Activation Mapping (GradCAM) to give each input diagnoses

a score. The highest scored diagnosis were deemed to be the

most important for making the correct prediction for a patient.

We also propose a way to summarize these top diagnoses for

each patient in our cohort and look at their temporal trends

to determine which codes contribute towards a positive Long

COVID diagnosis.

Index Terms—COVID-19, EHR, deep learning, GradCAM

I. INTRODUCTION

Some people infected with the COVID-19 virus have

demonstrated a wide range of health problems that can last

a long time post infection, which has been termed Long

COVID. According to the World Health Organization (WHO),

approximately 10-20% of the people infected with COVID-19

experience a variety of health conditions in the mid to long

term after they recover from the initial illness. According to the

NIH REsearching COVID to Enhance Recovery (RECOVER)2

program, which seeks to understand, treat, and prevent PASC,

this condition generally refers to ongoing health effects, new

or existing symptoms and other health problems that occur

after the acute phase of SARS-Cov-2 infection (i.e., present

1https://ncats.nih.gov/n3c

2For more information on RECOVER, visit https://recovercovid.org/

four or more weeks after the acute infection). Therefore, it has

become necessary to identify risk factors in a patient’s medical

history that can lead to them experiencing Long COVID.

The N3C repository contains records of patients with a

newly introduced ICD-103U09.9 code (”Post COVID-19 con-

dition”) that is being used to refer to patients being diagnosed

with Long COVID [1]. The N3C repository also contains

conditions, measurements and other medical records for these

patients. We focus our analysis on all the recorded medical

conditions recorded in the form of ICD-10 codes, for these

patients up to 45 days post the ﬁrst COVID diagnosis.

Previous efforts have focused on feature creation based

on comorbidities, demographics, medication and healthcare

utilization derived from the EHR data to develop machine

learning models that can predict if a patient will develop Long

Covid [2]. In this work, we incorporate all conditions rather

than limiting features to a pre-deﬁned list of comorbidities

to build a deep learning model to capture a more complete

picture of a patient’s medical history to ﬁnd risk factors asso-

ciated with Long COVID. We test different architectures for

analyzing longitudinal data using all diagnosis codes present

in a patient’s medical history arranged temporally, and then

use interpretability methods to identify conditions associated

with the risk of developing Long COVID.

II. RELATED WORK

Many previous works have analyzed temporal EHR data

in different settings like predicting clinical events and risk

stratiﬁcation [3]. They have focused on Long Short Term

Memory (LSTM) networks, a form of RNNs, to model

longitudinal data [4]. RNNs are a class of neural network

with looping connections between nodes such that temporal

information persists [5] [6]. This makes them very useful to

analyze time series data or applications where sequences have

3International Classiﬁcation of Diseases 10th Revision

arXiv:2210.02490v1 [cs.LG] 5 Oct 2022

to be analyzed like speech recognition, language modeling etc.

However, RNNs, by their nature, cannot remember long term

dependencies in a sequence [7]. LSTMs are a special kind of

RNN that are architected to remember long term dependencies

[5]. An LSTM unit consists of a cell and three gates: input,

output and forget. The cell remembers information at each

time step and the gates control the ﬂow of information, that

is, to either pass on or discard information to the next time

step.

Zhang et al. [8] use an LSTM based model to generate rep-

resentations for a cohort of Parkinson’s disease patients. The

input is a temporally ordered list of features {x1, x2...xNp}

at different times, extracted from the patient’s EHR. A set of

features are selected as prediction targets. At each time step ti,

the input is passed through two LSTM layers and the hidden

state output of the ﬁnal LSTM layer is used for calculating

the loss functions.

An LSTM network is unidirectional, as in, it only preserves

information from the past because it has only seen that part

of the sequence. A bidirectional LSTM (BiLSTM), on the

other hand, sees the input both ways, that is, backwards (past)

to forward (future) and forwards (future) to backward (past).

Therefore, at any given time, we are able to use information

from both the past and the future [6]. BiLSTM has been

successfully used to analyze a patient’s neuropsychological

test scale data, genetic data and tomographic data in ﬁrst, six

and twelve months used to predict Alzheimer’s Disease [9].

Attention has been used in addition to an LSTM network in

[10]. It works by extracting the hidden states from an LSTM

network and calculating an attention score αtthat can help

weight the inputs by training an additional attention layer. We

can later extract these attention weights during inference to

understand which part of the input was given a higher weight

during classiﬁcation.

2D Convolutional Neural Networks (CNN) have been pri-

marily used for computer vision applications, where multiple

ﬁlters are trained to detect different input image features. 1D

CNNs have been shown to work on time series problems

like longitudinal EHR data [11]. 1D Convolution works over

the temporal dimension with different ﬁlter sizes, where the

different ﬁlters learn different temporal patterns. This process

produces feature vectors which are then passed through a non-

linear layer like a Rectiﬁed Linear Unit (ReLU) or Tanh.

Gradient-weighted Class Activation Mapping (GradCAM)

has been used to examine 1D CNN based models that analyze

protein sequences and ﬁnd regions in the input sequences that

help the model make the correct prediction [12]. GradCAM

is generally used in computer vision to generate localization

maps for a given concept (class) in the input image [13]. These

maps are made by ﬁnding the gradient of the predicted class

in the activation map of the ﬁnal layer, pooling them channel

wise, and the resultant activation channels are weighted with

the corresponding gradients which can then be inspected to

ﬁnd which parts of the input helped in the classiﬁcation.

Therefore, deep learning in general has been shown to

provide value while analyzing clinical EHR data in a variety

of areas. In the following sections, we deﬁne our methodology

for analyzing the same in the N3C cohort.

III. METHODOLOGY

A. Dataset

The N3C data transfer to NCATS is performed under a

Johns Hopkins University Reliance Protocol # IRB00249128

or individual site agreements with NIH. The N3C repository

contains N=14,026,265 number of patients out of which

X=5,409,269 are COVID-19 positive [14]. COVID cases are

deﬁned as per CDC guidance [15]. We construct our Long

COVID positive cohort using patients with an existing U09.9

code or a long COVID clinic visit. The controls were con-

structed by choosing 5 random patients with the same site

and within 90 days of the long COVID patient. At the end,

we have 49,950 total patients, of which there were 7,511 Long

COVID patients and 38,649 Control patients.

B. Data pre-processing

The N3C EHR repository contains all historical medical

diagnosis codes stored using the Systematic Nomenclature

of Medicine - Clinical Terms (SNOMED-CT) vocabulary for

all patients. SNOMED-CT is a clinical terminology, widely

used by healthcare providers for documentation and reporting

within health systems [16]. Therefore, for each patient, we

have a list of these diagnosis codes along with the date when

the code was recorded. Since our goal is to ﬁnd risk factors

that can pre-dispose a patient to suffer from Long COVID,

we focus on all conditions, not including the Long COVID

diagnosis, in the patient’s diagnostic history up to 45 days

post the ﬁrst COVID diagnosis or positive test, which we used

as the acute phase cut off. We arrange all these conditions

in an ordered list from the earliest to the latest. We also

ensure that we insert only one record in the ordered list for

all conditions that were repeatedly recorded in a single day,

which can occur when one patient can have multiple tests

or diagnoses in a single day. At the end of this process, each

patient pihas an ordered list of diagnosis codes [d1, d2, ...dK],

where K= 1000. We select K= 1000 as the upper limit for

the length of the list of diagnosis codes as we found that 99%

of our patients had less than 1000 diagnosis codes present in

their medical history. We add padding using ”padding tokens”

([P AD]) to make all inputs of length shorter than 1000 of

uniform length. For those conditions for which we do not have

prior embeddings, we replace with [UNK]tokens.

C. Pre-trained SNOMED-CT embeddings

Prior work focuses on embedded vector representation

learning to make medical concepts analyzable via mathemat-

ical models and subsequently building models for analysis

[3]. To analyze these temporal patterns in an ordered list of

concept codes using deep learning, we ﬁrst have to transform

them into their equivalent vector representations that also

capture semantic meaning and similarities between different

diagnoses. We used 200-dimensional SNOMED embeddings

trained using SNOMED2Vec, a graph based representation

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AnalyzinghistoricaldiagnosiscodedatafromNIHN3CandRECOVERProgramsusingdeeplearningtodetermineriskfactorsforLongCovidSauravSengupta,JohannaLoombay,SuchethaSharmaz,DonaldE.BrownzLornaThorpex,MelissaAHaendelx,ChristopherGChutex,StephanieHongxonbehalfofN3CandRECOVERconsortiumDepartmentofEngineeringSys...

展开>> 收起<<

Analyzing historical diagnosis code data from NIH N3C and RECOVER Programs using deep learning to determine risk factors for Long Covid.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Analyzing historical diagnosis code data from NIH N3C and RECOVER Programs using deep learning to determine risk factors for Long Covid

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: