Hierarchical Neyman-Pearson Classification for Prioritizing Severe Disease Categories in COVID-19 Patient Data

2025-05-06 0 0 3.21MB 79 页 10玖币

侵权投诉

Hierarchical Neyman-Pearson Classiﬁcation for

Prioritizing Severe Disease Categories in COVID-19

Patient Data

Lijia Wang ∗

School of Data Science, City University of Hong Kong

Y. X. Rachel Wang ∗ †

School of Mathematics and Statistics, University of Sydney

Jingyi Jessica Li

Department of Statistics, University of California, Los Angeles

Xin Tong

Department of Data Sciences and Operations, University of Southern California

October 2, 2023

Abstract

COVID-19 has a spectrum of disease severity, ranging from asymptomatic to re-

quiring hospitalization. Understanding the mechanisms driving disease severity is

crucial for developing eﬀective treatments and reducing mortality rates. One way to

gain such understanding is using a multi-class classiﬁcation framework, in which pa-

tients’ biological features are used to predict patients’ severity classes. In this severity

classiﬁcation problem, it is beneﬁcial to prioritize the identiﬁcation of more severe

classes and control the “under-classiﬁcation” errors, in which patients are misclassi-

ﬁed into less severe categories. The Neyman-Pearson (NP) classiﬁcation paradigm has

been developed to prioritize the designated type of error. However, current NP pro-

cedures are either for binary classiﬁcation or do not provide high probability controls

on the prioritized errors in multi-class classiﬁcation. Here, we propose a hierarchical

NP (H-NP) framework and an umbrella algorithm that generally adapts to popular

classiﬁcation methods and controls the under-classiﬁcation errors with high proba-

bility. On an integrated collection of single-cell RNA-seq (scRNA-seq) datasets for

864 patients, we explore ways of featurization and demonstrate the eﬃcacy of the

H-NP algorithm in controlling the under-classiﬁcation errors regardless of featuriza-

tion. Beyond COVID-19 severity classiﬁcation, the H-NP algorithm generally applies

to multi-class classiﬁcation problems, where classes have a priority order.

∗Equal contribution

†Correspondence should be addressed to Y.X. Rachel Wang (rachel.wang@sydney.edu.au)

arXiv:2210.02197v2 [cs.LG] 29 Sep 2023

1 Introduction

The COVID-19 pandemic has infected over 767 million people and caused 6.94 million

deaths (27 June 2023) [World Health Organization, 2023], prompting collective eﬀorts from

statistics and other communities to address data-driven challenges. Many statistical works

have modeled epidemic dynamics [Betensky and Feng, 2020, Quick et al., 2021], forecasted

the case growth rates and outbreak locations [Brooks et al., 2020, Tang et al., 2021, Mc-

Donald et al., 2021], and analyzed and predicted the mortality rates [James et al., 2021,

Kramlinger et al., 2022]. Classiﬁcation problems, such as diagnosis (positive/negative) [Wu

et al., 2020, Li et al., 2020, Zhang et al., 2021] and severity prediction [Yan et al., 2020, Sun

et al., 2020, Zhao et al., 2020, Ortiz et al., 2022], have been tackled by machine learning ap-

proaches (e.g., logistic regression, support vector machine (SVM), random forest, boosting,

and neural networks; see Alballa and Al-Turaiki [2021] for a review).

In the existing COVID-19 classiﬁcation works, the commonly used data types are CT

images, routine blood tests, and other clinical data including age, blood pressure and medi-

cal history [Meraihi et al., 2022]. In comparison, multiomics data are harder to acquire but

can provide better insights into the molecular features driving patient responses [Overmyer

et al., 2021]. Recently, the increasing availability of single-cell RNA-seq (scRNA-seq) data

oﬀers the opportunity to understand transcriptional responses to COVID-19 severity at the

cellular level [Wilk et al., 2020, Stephenson et al., 2021, Ren et al., 2021].

More generally, genome-wide gene expression measurements have been routinely used in

classiﬁcation settings to characterize and distinguish disease subtypes, both in bulk-sample

[Aibar et al., 2015] and, more recently, single-cell level [Arvaniti and Claassen, 2017, Hu

et al., 2019]. While such genome-wide data can be costly, they provide a comprehensive view

of the transcriptome and can unveil signiﬁcant gene expression patterns for diseases with

complex pathophysiology, where multiple genes and pathways are involved. Furthermore,

as the patient-level measurements continue to grow in dimension and complexity (e.g., from

a single bulk sample to thousands-to-millions of cells per patient), a supervised learning

setting enables us to better establish the connection between patient-level features and

their associated disease states, paving the way towards personalized treatment.

In this study, we focus on patient severity classiﬁcation using an integrated collec-

tion of multi-patient scRNA-seq datasets. Based on the WHO guidelines [World Health

Organization, 2020], COVID-19 patients have at least three severity categories: healthy,

mild/moderate, and severe. The classical classiﬁcation paradigm aims at minimizing the

overall classiﬁcation error. However, prioritizing the identiﬁcation of more severe patients

may provide important insights into the biological mechanisms underlying disease progres-

sion and severity, and facilitate the discovery of potential biomarkers for clinical diagnosis

and therapeutic intervention. Consequently, it is important to prioritize the control of

“under-classiﬁcation” errors, in which patients are misclassiﬁed into less severe categories.

Motivated by the gap in existing classiﬁcation algorithms for severity classiﬁcation (Sec-

tion 1.1), we propose a hierarchical Neyman-Pearson (H-NP) classiﬁcation framework that

prioritizes the under-classiﬁcation error control in the following sense. Suppose there are I

classes with class labels [I] = {1,2,...,I} ordered in decreasing severity. For i∈[I − 1],

the i-th under-classiﬁcation error is the probability of misclassifying an individual in class

iinto any class jwith j > i. We develop an H-NP umbrella algorithm that controls the

i-th under-classiﬁcation error below a user-speciﬁed level αi∈(0,1) with high probability

while minimizing a weighted sum of the remaining classiﬁcation errors. Similar in spirit

to the NP umbrella algorithm for binary classiﬁcation in Tong et al. [2018], the H-NP

umbrella algorithm adapts to popular scoring-type multi-class classiﬁcation methods (e.g.,

logistic regression, random forest, and SVM). To our knowledge, the algorithm is the ﬁrst

to achieve asymmetric error control with high probability in multi-class classiﬁcation.

Another contribution of this study is the exploration of appropriate ways to featurize

multi-patient scRNA-seq data. Following the workﬂow in Lin et al. [2022a], we integrate

20 publicly available scRNA-seq datasets to form a sample of 864 patients with three

levels of severity. For each patient, scRNA-seq data were collected from peripheral blood

mononuclear cells (PBMCs) and processed into a sparse expression matrix, which consists

of tens of thousands of genes in rows and thousands of cells in columns. We propose four

ways of extracting a feature vector from each of these 864 matrices. Then we evaluate the

performance of each featurization way in combination with multiple classiﬁcation methods

under both the classical and H-NP classiﬁcation paradigms. We note that our H-NP

umbrella algorithm is applicable to other featurizations of scRNA-seq data, other forms of

patient data, and more general disease classiﬁcation problems with a severity ordering.

Below we review the NP paradigm and featurization of multi-patient scRNA-seq data

as the background of our work.

1.1 Neyman-Pearson paradigm and multi-class classiﬁcation

Classical binary classiﬁcation focuses on minimizing the overall classiﬁcation error, i.e., a

weighted sum of type I and II errors, where the weights are the marginal probabilities of

the two classes. However, the class priorities are not reﬂected by the class weights in many

applications, especially disease severity classiﬁcation, where the severe class is the minor

class and has a smaller weight (e.g., HIV [Meyer and Pauker, 1987] and cancer [Dettling

and B¨uhlmann, 2003]). One class of methods that addresses this error asymmetry is cost-

sensitive learning [Elkan, 2001, Margineantu, 2002], which assigns diﬀerent costs to type I

and type II errors. However, such weights may not be easy to choose in practice, especially

in a multi-class setting; nor do these methods provide high probability controls on the

prioritized errors. The NP classiﬁcation paradigm [Cannon et al., 2002, Scott and Nowak,

2005, Rigollet and Tong, 2011] was developed as an alternative framework to enforce class

priorities: it ﬁnds a classiﬁer that controls the population type I error (the prioritized

error, e.g., misclassifying diseased patients as healthy) under a user-speciﬁed level αwhile

minimizing the type II error (the error with less priority, e.g., misdiagnosing healthy people

as sick). Practically, using an order statistics approach, Tong et al. [2018] proposed an

NP umbrella algorithm that adapts all scoring-type classiﬁcation methods (e.g., logistic

regression) to the NP paradigm for classiﬁer construction. The resulting classiﬁer has the

population type I error under αwith high probability. Besides disease severity classiﬁcation,

the NP classiﬁcation paradigm has found diverse applications, including social media text

classiﬁcation [Xia et al., 2021] and crisis risk control [Feng et al., 2021]. Nevertheless, the

original NP paradigm is for binary classiﬁcation only.

Although several works aimed to control prioritized errors in multi-class classiﬁcation

[Landgrebe and Duin, 2005, Xiong et al., 2006, Tian and Feng, 2021], they did not provide

high probability control. That is, if they are applied to severe disease classiﬁcation, there

is a non-trivial chance that their under-classiﬁcation errors exceed the desired levels.

1.2 ScRNA-seq data featurization

In multi-patient scRNA-seq data, every patient has a gene-by-cell expression matrix; genes

are matched across patients, but cells are not. For learning tasks with patients as instances,

featurization is a necessary step to ensure that all patients have feautures in the same space.

A common featurization approach is to assign every patient’s cells into cell types, which are

comparable across patients, by clustering [Stanley et al., 2020, Ganio et al., 2020] and/or

manual annotation [Han et al., 2019]. Then, each patient’s gene-by-cell expression matrix

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HierarchicalNeyman-PearsonClassificationforPrioritizingSevereDiseaseCategoriesinCOVID-19PatientDataLijiaWang∗SchoolofDataScience,CityUniversityofHongKongY.X.RachelWang∗†SchoolofMathematicsandStatistics,UniversityofSydneyJingyiJessicaLiDepartmentofStatistics,UniversityofCalifornia,LosAngelesXinTongDe...

展开>> 收起<<

Hierarchical Neyman-Pearson Classification for Prioritizing Severe Disease Categories in COVID-19 Patient Data.pdf

共79页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Hierarchical Neyman-Pearson Classification for Prioritizing Severe Disease Categories in COVID-19 Patient Data

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: