Dimension reduction of high-dimension categorical data with two or multiple responses considering interactions between responses

2025-05-03 0 0 506.78KB 25 页 10玖币

侵权投诉

Dimension reduction of high-dimension categorical

data with two or multiple responses considering

interactions between responses

Yuehan Yang

School of Statistics and Mathematics, Central University of Finance and Economics,

Beijing, 102206, China

Abstract

This paper models categorical data with two or multiple responses, focusing

on the interactions between responses. We propose an eﬃcient iterative procedure

based on suﬃcient dimension reduction. We study the theoretical guarantees of

the proposed method under the two- and multiple-response models, demonstrating

the uniqueness of the proposed estimator and with the high probability that the

proposed method recovers the oracle least squares estimators. For data analysis, we

demonstrate that the proposed method is eﬃcient in the multiple-response model

and performs better than some existing methods built in the multiple-response

models. We apply this modeling and the proposed method to an adult dataset and

right heart catheterization dataset and obtain meaningful results.

Keywords: Categorical data, High-dimensional regression, Nonconvex penalty, Suﬃcient

dimension reduction

1 Introduction

Categorical data with multiple responses appear in many applications. For example, in

a right heart catheterization dataset, some researchers focused on patients who had been

arXiv:2210.11811v1 [stat.ME] 21 Oct 2022

diagnosed with sepsis in multiple organs (Luo et al., 2022). Meanwhile, patient health

data contain records of diagnoses within controlled vocabularies, as well as prescriptions,

thus generating categorical features with numbers of levels (Stokell et al., 2021; Jensen

et al., 2012). Models that consider a univariate response are frequently studied. Related

research include conducting signiﬁcance tests (Tukey, 1949; Calinski and Corsten, 1985)

and fusing the levels of categories together in the linear regression setting (Breiman et al.,

2017; Bondell and Reich, 2009; Post and Bondell, 2013; Pauger and Wagner, 2019).

Regarding a univariate response, a valuable technique is to consider an analysis of

variance (ANOVA) model that relates the response to categorical predictors. Based on

this model, Bondell and Reich (2009) proposed the collapsing and shrinkage in ANOVA

(CAS-ANOVA) penalty in which the balanced eﬀects of the categories with certain lev-

els were more prevalent than others. Similarly, Pauger and Wagner (2019) proposed a

Bayesian approach to encourage levels fusion. Further, Stokell et al. (2021) proposed a

SCOPE estimator to fuse the levels of several categories by equivalating the correspond-

ing coeﬃcients. Other studies considered the tree-based model for hierarchical categorical

predictors (Carrizosa et al., 2022), clustering the categories of categorical predictors in

generalized linear models (Carrizosa et al., 2021), systematic overview of penalty-based

methods for categorical data (Tutz and Gertheiss, 2016), etc.

However, in certain applications, the data often comprise two or multiple responses.

For example, Little and Rubin (2019) described missing data in which one response is

a partially missing variable while the other is the missing indicator of the former. An

eﬃcient strategy for studying multiple responses is to introduce suﬃcient dimension re-

duction. With two responses, Ding et al. (2020) studied the dimension reduction in sur-

vival analysis; therein, one response is of interest while the other is a nuisance variable.

Furthermore, Luo et al. (2022) studied the dimension reduction regarding the interaction

between two responses. De Luna et al. (2011) proposed an iterative two-step procedure

to derive a minimal balancing score that connects with the local dimension reduction

eﬃciency in the causal inference.

Other dimension reduction methods for diﬀerent goals have also been studied in the

literature, e.g., missing data analysis (Guo et al., 2018) and causal inference (Ma et al.,

2019; Luo and Zhu, 2020). However, none of the foregoing studies considered categorical

data analysis.

Although categorical data draws enormous attention, there is still lack of studies on

the multiple-response model, including the eﬃcient algorithm and reliable theoretical

guarantees. Thus, the goal of this paper is to construct a model and develop a method

for estimating high-dimensional linear models with categorical data considering the in-

teractions between the responses. The ﬁrst contribution of this paper is to establish the

statistical modeling of the categorical data with two or multiple responses. Although

many studies have analyzed categorical data with the univariate-response model, it is

unclear whether these analyses hold for the data comprising two or multiple responses;

moreover, it is also unclear how to study the multiple-response model by techniques used

in the univariate response model. To ﬁll in this gap, we establish the dimension reduction

theory in categorical data by applying the suﬃcient dimension reduction technique and

locally eﬃcient dimension reduction subspace. We construct a modeling that considers

the interactions between response variables, which is a new research problem concern with

wide applications in categorical data analysis.

The second contribution of this paper is to propose an eﬃcient iterative procedure for

analyzing categorical data via a multiple-response model. The proposed method extends

the algorithm of De Luna et al. (2011) for handling multiple responses with categorical

covariates. We show that the resulting estimator coincides with the least squares solution

in the multiple-response case. Most relevant to our theoretical results, Stokell et al. (2021)

established the theoretical results for their estimator, which fuses the category levels

but with the univariate response. Additionally, we show that the proposed procedure is

eﬃcient in simulations and applications. We apply the procedure to two real data, adult

dataset and right heart catheterization dataset, and both data analyses demonstrate the

eﬀectiveness of the model and method.

2 Models and methods

2.1 Notation and models

We ﬁrst introduce the notation, as well as the two-response model. Consider an ANOVA

model relating two responses and categorical predictors. Set two response variables, y1and

y2, and the categorical predictors, X, where X= (X1, . . . , Xp) and Xj= (x1j, . . . , xnj).

We obtain xij ∈ {1, . . . , Kj}where j= 1, . . . , p. Further, we set the coeﬃcient parameters,

(µ1, θ1) and (µ2, θ2), corresponding to both responses respectively. Therein, µ1and µ2

are intercepts and θ1, θ2∈RK1× ··· × RKpwhere θ1j:= (θ1jk)Kj

k=1 ∈RKjand θ2j:=

(θ2jk)Kj

k=1 ∈RKj.θ1jk and θ2jk are the coeﬃcients of responses y1and y2of the kth level

of the jth predictor respectively. Consider the following ANOVA models that relate the

two responses and categorical predictors:

y1=µ1+

j=1

k=1

θ1jk1{xj=k}+1,

y2=µ2+

j=1

k=1

θ2jk1{xj=k}+2,

where 1= (11, . . . , 1n), 2= (21, . . . , 2n) are the independent zero mean random er-

rors. In this model, the target is to properly estimate (µ1, θ1) and (µ2, θ2) regarding the

interaction between the two responses.

We introduce the dimension reduction technique into the above model. By setting

R= (R1, . . . , Rp)∈RK1× ··· × RKpsuch that Rj:= (1xj=k)Kj

k=1 ∈RKj, suﬃcient

dimension reduction assumes the existence of a low-dimensional linear combination of R.

Namely, we obtain y1⊥R|θ1⊗R, where ⊥denotes the independence between the two

responses. Consider the other response, y2, which is interacted with y1. Speciﬁcally, for

(y1, y2), we consider the situation in which they are interacted only through the covariates,

i.e., y1⊥y2|X. Combined with y1⊥X|θ1⊗R, we deduce a brief assumption under

suﬃcient dimension reduction,

y1⊥y2|θ⊗R. (1)

Similar to y2, we have y2⊥X|θ2⊗R.θ1⊗Rand θ2⊗Rcontain the information of y1

and y2respectively. Particularly, we allow both linear combinations to contain redundant

information only from y1and y2respectively. The interaction between y1and y2is ﬂexible,

and this is convenient for the modeling in the latter. We illustrate this theoretically in

the next section by showing that assumption 1 holds, and a low-dimensional θ⊗Rexists

for suﬃcient dimension reduction when θ1⊗R⊥θ2⊗R|θ⊗R.

Extending the aforementioned two-response model, we introduce the multiple-response

model. For the responses (y1, . . . , yq), we set the coeﬃcient parameters (µl, θl) where

µ1, . . . , µqare intercepts, and θ1, . . . , θq∈RK1× ··· × RKpwhere θlj := (θljk)Kj

k=1 ∈RKj.

θljk denotes the coeﬃcient to the response, yl, of the kth level of the jth predictor. Then,

we construct the following models for l= 1, . . . , q:

yl=µl+

j=1

k=1

θljk1{xj=k}+l,

where l= (l1, . . . , ln) are the independent zero mean random errors. In this model, the

number of estimated parameters depends on the numbers of predictors and responses,

and can be much higher than that of the univariate- or two-response model. In this case,

suﬃcient dimension reduction is a useful technique. Similar to the two-response model,

we have a brief assumption for the above modeling. Note that R∈RK1× ··· × RKp,

where Rj:= (1xj=k)Kj

k=1 ∈RKj. For each response, the linear combination of Rpreserves

all the information in Rof modeling the response, i.e., for l= 1, . . . , q,

yl⊥R|θl⊗R.

We assume that the responses interacted only through the covariates. Thus, this assump-

tion is deduced by combining it with the above model, as follows:

y1⊥ ··· ⊥ yq|θ⊗R. (2)

We allow each linear combination to carry redundant information from its response only.

θ⊗Rdiﬀers from θl⊗Rfor l= 1, . . . , q, where the former only preserves the information

of Rregarding the interactions between the responses and does not need to preserve

the information about the modeling of each response, such as the latter. Based on this

assumption, the dimension reduction technique is allowed for the loss of information about

the modeling of each response. Additionally, we prove in the following that θ⊗Rexists

and that assumption 2 holds when θl1⊗R⊥θ2⊗R|θ⊗R.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Dimensionreductionofhigh-dimensioncategoricaldatawithtwoormultipleresponsesconsideringinteractionsbetweenresponsesYuehanYangSchoolofStatisticsandMathematics,CentralUniversityofFinanceandEconomics,Beijing,102206,ChinaAbstractThispapermodelscategoricaldatawithtwoormultipleresponses,focusingontheintera...

展开>> 收起<<

Dimension reduction of high-dimension categorical data with two or multiple responses considering interactions between responses.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Dimension reduction of high-dimension categorical data with two or multiple responses considering interactions between responses

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: