Dimension reduction of high-dimension categorical data with two or multiple responses considering interactions between responses

2025-05-03 0 0 506.78KB 25 页 10玖币
侵权投诉
Dimension reduction of high-dimension categorical
data with two or multiple responses considering
interactions between responses
Yuehan Yang
School of Statistics and Mathematics, Central University of Finance and Economics,
Beijing, 102206, China
Abstract
This paper models categorical data with two or multiple responses, focusing
on the interactions between responses. We propose an efficient iterative procedure
based on sufficient dimension reduction. We study the theoretical guarantees of
the proposed method under the two- and multiple-response models, demonstrating
the uniqueness of the proposed estimator and with the high probability that the
proposed method recovers the oracle least squares estimators. For data analysis, we
demonstrate that the proposed method is efficient in the multiple-response model
and performs better than some existing methods built in the multiple-response
models. We apply this modeling and the proposed method to an adult dataset and
right heart catheterization dataset and obtain meaningful results.
Keywords: Categorical data, High-dimensional regression, Nonconvex penalty, Sufficient
dimension reduction
1 Introduction
Categorical data with multiple responses appear in many applications. For example, in
a right heart catheterization dataset, some researchers focused on patients who had been
1
arXiv:2210.11811v1 [stat.ME] 21 Oct 2022
diagnosed with sepsis in multiple organs (Luo et al., 2022). Meanwhile, patient health
data contain records of diagnoses within controlled vocabularies, as well as prescriptions,
thus generating categorical features with numbers of levels (Stokell et al., 2021; Jensen
et al., 2012). Models that consider a univariate response are frequently studied. Related
research include conducting significance tests (Tukey, 1949; Calinski and Corsten, 1985)
and fusing the levels of categories together in the linear regression setting (Breiman et al.,
2017; Bondell and Reich, 2009; Post and Bondell, 2013; Pauger and Wagner, 2019).
Regarding a univariate response, a valuable technique is to consider an analysis of
variance (ANOVA) model that relates the response to categorical predictors. Based on
this model, Bondell and Reich (2009) proposed the collapsing and shrinkage in ANOVA
(CAS-ANOVA) penalty in which the balanced effects of the categories with certain lev-
els were more prevalent than others. Similarly, Pauger and Wagner (2019) proposed a
Bayesian approach to encourage levels fusion. Further, Stokell et al. (2021) proposed a
SCOPE estimator to fuse the levels of several categories by equivalating the correspond-
ing coefficients. Other studies considered the tree-based model for hierarchical categorical
predictors (Carrizosa et al., 2022), clustering the categories of categorical predictors in
generalized linear models (Carrizosa et al., 2021), systematic overview of penalty-based
methods for categorical data (Tutz and Gertheiss, 2016), etc.
However, in certain applications, the data often comprise two or multiple responses.
For example, Little and Rubin (2019) described missing data in which one response is
a partially missing variable while the other is the missing indicator of the former. An
efficient strategy for studying multiple responses is to introduce sufficient dimension re-
duction. With two responses, Ding et al. (2020) studied the dimension reduction in sur-
vival analysis; therein, one response is of interest while the other is a nuisance variable.
Furthermore, Luo et al. (2022) studied the dimension reduction regarding the interaction
between two responses. De Luna et al. (2011) proposed an iterative two-step procedure
to derive a minimal balancing score that connects with the local dimension reduction
efficiency in the causal inference.
Other dimension reduction methods for different goals have also been studied in the
literature, e.g., missing data analysis (Guo et al., 2018) and causal inference (Ma et al.,
2019; Luo and Zhu, 2020). However, none of the foregoing studies considered categorical
2
data analysis.
Although categorical data draws enormous attention, there is still lack of studies on
the multiple-response model, including the efficient algorithm and reliable theoretical
guarantees. Thus, the goal of this paper is to construct a model and develop a method
for estimating high-dimensional linear models with categorical data considering the in-
teractions between the responses. The first contribution of this paper is to establish the
statistical modeling of the categorical data with two or multiple responses. Although
many studies have analyzed categorical data with the univariate-response model, it is
unclear whether these analyses hold for the data comprising two or multiple responses;
moreover, it is also unclear how to study the multiple-response model by techniques used
in the univariate response model. To fill in this gap, we establish the dimension reduction
theory in categorical data by applying the sufficient dimension reduction technique and
locally efficient dimension reduction subspace. We construct a modeling that considers
the interactions between response variables, which is a new research problem concern with
wide applications in categorical data analysis.
The second contribution of this paper is to propose an efficient iterative procedure for
analyzing categorical data via a multiple-response model. The proposed method extends
the algorithm of De Luna et al. (2011) for handling multiple responses with categorical
covariates. We show that the resulting estimator coincides with the least squares solution
in the multiple-response case. Most relevant to our theoretical results, Stokell et al. (2021)
established the theoretical results for their estimator, which fuses the category levels
but with the univariate response. Additionally, we show that the proposed procedure is
efficient in simulations and applications. We apply the procedure to two real data, adult
dataset and right heart catheterization dataset, and both data analyses demonstrate the
effectiveness of the model and method.
3
2 Models and methods
2.1 Notation and models
We first introduce the notation, as well as the two-response model. Consider an ANOVA
model relating two responses and categorical predictors. Set two response variables, y1and
y2, and the categorical predictors, X, where X= (X1, . . . , Xp) and Xj= (x1j, . . . , xnj).
We obtain xij ∈ {1, . . . , Kj}where j= 1, . . . , p. Further, we set the coefficient parameters,
(µ1, θ1) and (µ2, θ2), corresponding to both responses respectively. Therein, µ1and µ2
are intercepts and θ1, θ2RK1× ··· × RKpwhere θ1j:= (θ1jk)Kj
k=1 RKjand θ2j:=
(θ2jk)Kj
k=1 RKj.θ1jk and θ2jk are the coefficients of responses y1and y2of the kth level
of the jth predictor respectively. Consider the following ANOVA models that relate the
two responses and categorical predictors:
y1=µ1+
p
X
j=1
Kj
X
k=1
θ1jk1{xj=k}+1,
y2=µ2+
p
X
j=1
Kj
X
k=1
θ2jk1{xj=k}+2,
where 1= (11, . . . , 1n), 2= (21, . . . , 2n) are the independent zero mean random er-
rors. In this model, the target is to properly estimate (µ1, θ1) and (µ2, θ2) regarding the
interaction between the two responses.
We introduce the dimension reduction technique into the above model. By setting
R= (R1, . . . , Rp)RK1× ··· × RKpsuch that Rj:= (1xj=k)Kj
k=1 RKj, sufficient
dimension reduction assumes the existence of a low-dimensional linear combination of R.
Namely, we obtain y1R|θ1R, where denotes the independence between the two
responses. Consider the other response, y2, which is interacted with y1. Specifically, for
(y1, y2), we consider the situation in which they are interacted only through the covariates,
i.e., y1y2|X. Combined with y1X|θ1R, we deduce a brief assumption under
sufficient dimension reduction,
y1y2|θR. (1)
Similar to y2, we have y2X|θ2R.θ1Rand θ2Rcontain the information of y1
and y2respectively. Particularly, we allow both linear combinations to contain redundant
4
information only from y1and y2respectively. The interaction between y1and y2is flexible,
and this is convenient for the modeling in the latter. We illustrate this theoretically in
the next section by showing that assumption 1 holds, and a low-dimensional θRexists
for sufficient dimension reduction when θ1Rθ2R|θR.
Extending the aforementioned two-response model, we introduce the multiple-response
model. For the responses (y1, . . . , yq), we set the coefficient parameters (µl, θl) where
µ1, . . . , µqare intercepts, and θ1, . . . , θqRK1× ··· × RKpwhere θlj := (θljk)Kj
k=1 RKj.
θljk denotes the coefficient to the response, yl, of the kth level of the jth predictor. Then,
we construct the following models for l= 1, . . . , q:
yl=µl+
p
X
j=1
Kj
X
k=1
θljk1{xj=k}+l,
where l= (l1, . . . , ln) are the independent zero mean random errors. In this model, the
number of estimated parameters depends on the numbers of predictors and responses,
and can be much higher than that of the univariate- or two-response model. In this case,
sufficient dimension reduction is a useful technique. Similar to the two-response model,
we have a brief assumption for the above modeling. Note that RRK1× ··· × RKp,
where Rj:= (1xj=k)Kj
k=1 RKj. For each response, the linear combination of Rpreserves
all the information in Rof modeling the response, i.e., for l= 1, . . . , q,
ylR|θlR.
We assume that the responses interacted only through the covariates. Thus, this assump-
tion is deduced by combining it with the above model, as follows:
y1⊥ ··· ⊥ yq|θR. (2)
We allow each linear combination to carry redundant information from its response only.
θRdiffers from θlRfor l= 1, . . . , q, where the former only preserves the information
of Rregarding the interactions between the responses and does not need to preserve
the information about the modeling of each response, such as the latter. Based on this
assumption, the dimension reduction technique is allowed for the loss of information about
the modeling of each response. Additionally, we prove in the following that θRexists
and that assumption 2 holds when θl1Rθ2R|θR.
5
摘要:

Dimensionreductionofhigh-dimensioncategoricaldatawithtwoormultipleresponsesconsideringinteractionsbetweenresponsesYuehanYangSchoolofStatisticsandMathematics,CentralUniversityofFinanceandEconomics,Beijing,102206,ChinaAbstractThispapermodelscategoricaldatawithtwoormultipleresponses,focusingontheintera...

展开>> 收起<<
Dimension reduction of high-dimension categorical data with two or multiple responses considering interactions between responses.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:506.78KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注