Hierarchical Neyman-Pearson Classification for Prioritizing Severe Disease Categories in COVID-19 Patient Data

2025-05-06 0 0 3.21MB 79 页 10玖币
侵权投诉
Hierarchical Neyman-Pearson Classification for
Prioritizing Severe Disease Categories in COVID-19
Patient Data
Lijia Wang
School of Data Science, City University of Hong Kong
Y. X. Rachel Wang ∗ †
School of Mathematics and Statistics, University of Sydney
Jingyi Jessica Li
Department of Statistics, University of California, Los Angeles
Xin Tong
Department of Data Sciences and Operations, University of Southern California
October 2, 2023
Abstract
COVID-19 has a spectrum of disease severity, ranging from asymptomatic to re-
quiring hospitalization. Understanding the mechanisms driving disease severity is
crucial for developing effective treatments and reducing mortality rates. One way to
gain such understanding is using a multi-class classification framework, in which pa-
tients’ biological features are used to predict patients’ severity classes. In this severity
classification problem, it is beneficial to prioritize the identification of more severe
classes and control the “under-classification” errors, in which patients are misclassi-
fied into less severe categories. The Neyman-Pearson (NP) classification paradigm has
been developed to prioritize the designated type of error. However, current NP pro-
cedures are either for binary classification or do not provide high probability controls
on the prioritized errors in multi-class classification. Here, we propose a hierarchical
NP (H-NP) framework and an umbrella algorithm that generally adapts to popular
classification methods and controls the under-classification errors with high proba-
bility. On an integrated collection of single-cell RNA-seq (scRNA-seq) datasets for
864 patients, we explore ways of featurization and demonstrate the efficacy of the
H-NP algorithm in controlling the under-classification errors regardless of featuriza-
tion. Beyond COVID-19 severity classification, the H-NP algorithm generally applies
to multi-class classification problems, where classes have a priority order.
Equal contribution
Correspondence should be addressed to Y.X. Rachel Wang (rachel.wang@sydney.edu.au)
1
arXiv:2210.02197v2 [cs.LG] 29 Sep 2023
1 Introduction
The COVID-19 pandemic has infected over 767 million people and caused 6.94 million
deaths (27 June 2023) [World Health Organization, 2023], prompting collective efforts from
statistics and other communities to address data-driven challenges. Many statistical works
have modeled epidemic dynamics [Betensky and Feng, 2020, Quick et al., 2021], forecasted
the case growth rates and outbreak locations [Brooks et al., 2020, Tang et al., 2021, Mc-
Donald et al., 2021], and analyzed and predicted the mortality rates [James et al., 2021,
Kramlinger et al., 2022]. Classification problems, such as diagnosis (positive/negative) [Wu
et al., 2020, Li et al., 2020, Zhang et al., 2021] and severity prediction [Yan et al., 2020, Sun
et al., 2020, Zhao et al., 2020, Ortiz et al., 2022], have been tackled by machine learning ap-
proaches (e.g., logistic regression, support vector machine (SVM), random forest, boosting,
and neural networks; see Alballa and Al-Turaiki [2021] for a review).
In the existing COVID-19 classification works, the commonly used data types are CT
images, routine blood tests, and other clinical data including age, blood pressure and medi-
cal history [Meraihi et al., 2022]. In comparison, multiomics data are harder to acquire but
can provide better insights into the molecular features driving patient responses [Overmyer
et al., 2021]. Recently, the increasing availability of single-cell RNA-seq (scRNA-seq) data
offers the opportunity to understand transcriptional responses to COVID-19 severity at the
cellular level [Wilk et al., 2020, Stephenson et al., 2021, Ren et al., 2021].
More generally, genome-wide gene expression measurements have been routinely used in
classification settings to characterize and distinguish disease subtypes, both in bulk-sample
[Aibar et al., 2015] and, more recently, single-cell level [Arvaniti and Claassen, 2017, Hu
et al., 2019]. While such genome-wide data can be costly, they provide a comprehensive view
of the transcriptome and can unveil significant gene expression patterns for diseases with
2
complex pathophysiology, where multiple genes and pathways are involved. Furthermore,
as the patient-level measurements continue to grow in dimension and complexity (e.g., from
a single bulk sample to thousands-to-millions of cells per patient), a supervised learning
setting enables us to better establish the connection between patient-level features and
their associated disease states, paving the way towards personalized treatment.
In this study, we focus on patient severity classification using an integrated collec-
tion of multi-patient scRNA-seq datasets. Based on the WHO guidelines [World Health
Organization, 2020], COVID-19 patients have at least three severity categories: healthy,
mild/moderate, and severe. The classical classification paradigm aims at minimizing the
overall classification error. However, prioritizing the identification of more severe patients
may provide important insights into the biological mechanisms underlying disease progres-
sion and severity, and facilitate the discovery of potential biomarkers for clinical diagnosis
and therapeutic intervention. Consequently, it is important to prioritize the control of
“under-classification” errors, in which patients are misclassified into less severe categories.
Motivated by the gap in existing classification algorithms for severity classification (Sec-
tion 1.1), we propose a hierarchical Neyman-Pearson (H-NP) classification framework that
prioritizes the under-classification error control in the following sense. Suppose there are I
classes with class labels [I] = {1,2,...,I} ordered in decreasing severity. For i[I 1],
the i-th under-classification error is the probability of misclassifying an individual in class
iinto any class jwith j > i. We develop an H-NP umbrella algorithm that controls the
i-th under-classification error below a user-specified level αi(0,1) with high probability
while minimizing a weighted sum of the remaining classification errors. Similar in spirit
to the NP umbrella algorithm for binary classification in Tong et al. [2018], the H-NP
umbrella algorithm adapts to popular scoring-type multi-class classification methods (e.g.,
3
logistic regression, random forest, and SVM). To our knowledge, the algorithm is the first
to achieve asymmetric error control with high probability in multi-class classification.
Another contribution of this study is the exploration of appropriate ways to featurize
multi-patient scRNA-seq data. Following the workflow in Lin et al. [2022a], we integrate
20 publicly available scRNA-seq datasets to form a sample of 864 patients with three
levels of severity. For each patient, scRNA-seq data were collected from peripheral blood
mononuclear cells (PBMCs) and processed into a sparse expression matrix, which consists
of tens of thousands of genes in rows and thousands of cells in columns. We propose four
ways of extracting a feature vector from each of these 864 matrices. Then we evaluate the
performance of each featurization way in combination with multiple classification methods
under both the classical and H-NP classification paradigms. We note that our H-NP
umbrella algorithm is applicable to other featurizations of scRNA-seq data, other forms of
patient data, and more general disease classification problems with a severity ordering.
Below we review the NP paradigm and featurization of multi-patient scRNA-seq data
as the background of our work.
1.1 Neyman-Pearson paradigm and multi-class classification
Classical binary classification focuses on minimizing the overall classification error, i.e., a
weighted sum of type I and II errors, where the weights are the marginal probabilities of
the two classes. However, the class priorities are not reflected by the class weights in many
applications, especially disease severity classification, where the severe class is the minor
class and has a smaller weight (e.g., HIV [Meyer and Pauker, 1987] and cancer [Dettling
and B¨uhlmann, 2003]). One class of methods that addresses this error asymmetry is cost-
sensitive learning [Elkan, 2001, Margineantu, 2002], which assigns different costs to type I
and type II errors. However, such weights may not be easy to choose in practice, especially
4
in a multi-class setting; nor do these methods provide high probability controls on the
prioritized errors. The NP classification paradigm [Cannon et al., 2002, Scott and Nowak,
2005, Rigollet and Tong, 2011] was developed as an alternative framework to enforce class
priorities: it finds a classifier that controls the population type I error (the prioritized
error, e.g., misclassifying diseased patients as healthy) under a user-specified level αwhile
minimizing the type II error (the error with less priority, e.g., misdiagnosing healthy people
as sick). Practically, using an order statistics approach, Tong et al. [2018] proposed an
NP umbrella algorithm that adapts all scoring-type classification methods (e.g., logistic
regression) to the NP paradigm for classifier construction. The resulting classifier has the
population type I error under αwith high probability. Besides disease severity classification,
the NP classification paradigm has found diverse applications, including social media text
classification [Xia et al., 2021] and crisis risk control [Feng et al., 2021]. Nevertheless, the
original NP paradigm is for binary classification only.
Although several works aimed to control prioritized errors in multi-class classification
[Landgrebe and Duin, 2005, Xiong et al., 2006, Tian and Feng, 2021], they did not provide
high probability control. That is, if they are applied to severe disease classification, there
is a non-trivial chance that their under-classification errors exceed the desired levels.
1.2 ScRNA-seq data featurization
In multi-patient scRNA-seq data, every patient has a gene-by-cell expression matrix; genes
are matched across patients, but cells are not. For learning tasks with patients as instances,
featurization is a necessary step to ensure that all patients have feautures in the same space.
A common featurization approach is to assign every patient’s cells into cell types, which are
comparable across patients, by clustering [Stanley et al., 2020, Ganio et al., 2020] and/or
manual annotation [Han et al., 2019]. Then, each patient’s gene-by-cell expression matrix
5
摘要:

HierarchicalNeyman-PearsonClassificationforPrioritizingSevereDiseaseCategoriesinCOVID-19PatientDataLijiaWang∗SchoolofDataScience,CityUniversityofHongKongY.X.RachelWang∗†SchoolofMathematicsandStatistics,UniversityofSydneyJingyiJessicaLiDepartmentofStatistics,UniversityofCalifornia,LosAngelesXinTongDe...

展开>> 收起<<
Hierarchical Neyman-Pearson Classification for Prioritizing Severe Disease Categories in COVID-19 Patient Data.pdf

共79页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:79 页 大小:3.21MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 79
客服
关注