Improving Group Lasso for high-dimensional categorical data Szymon Nowakowski Piotr Pokarowski

2025-05-08 0 0 465.43KB 29 页 10玖币
侵权投诉
Improving Group Lasso for high-dimensional
categorical data
Szymon Nowakowski *
, Piotr Pokarowski
,
Wojciech Rejchel §
, Agnieszka Sołtys
Abstract
Sparse modelling or model selection with categorical data is challenging
even for a moderate number of variables, because one parameter is roughly
needed to encode one category or level. The Group Lasso is a well known
efficient algorithm for selection continuous or categorical variables, but all
estimates related to a selected factor usually differ. Therefore, a fitted model
may not be sparse, which makes the model interpretation difficult. To ob-
tain a sparse solution of the Group Lasso we propose the following two-step
procedure: first, we reduce data dimensionality using the Group Lasso; then
to choose the final model we use an information criterion on a small family
of models prepared by clustering levels of individual factors. We investigate
selection correctness of the algorithm in a sparse high-dimensional scenario.
We also test our method on synthetic as well as real datasets and show that it
performs better than the state of the art algorithms with respect to the predic-
tion accuracy or model dimension.
1 Introduction
Data sets containing categorical variables (factors) are common in statistics and
machine learning. The analysis of such data sets is much more challenging than
for those having only numerical variables. There are two main reasons of that:
*Faculty of Physics, University of Warsaw, Pasteura 5, 02-093 Warsaw, Poland
Institute of Applied Mathematics and Mechanics, University of Warsaw, Banacha 2, 02-097
Warsaw, Poland , sd.nowakowski2@uw.edu.pl
Institute of Applied Mathematics and Mechanics, University of Warsaw, Banacha 2, 02-097
Warsaw, Poland, pokar@mimuw.edu.pl
§Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Chopina 12/18,
87-100, Toru´
n, Poland, wrejchel@gmail.com
Institute of Applied Mathematics and Mechanics, University of Warsaw, Banacha 2, 02-097
Warsaw, Poland, agnieszkaprochenka@gmail.com
1
arXiv:2210.14021v3 [stat.ME] 11 Nov 2022
the first one relates to the fact that a factor with klevels is usually encoded as
k1dummy variables, so k1parameters are needed to learn (there is only one
such parameter for a numerical predictor). Therefore, dimensionality reduction
is crucial, which becomes the second reason. Namely, dimensionality reduction
is much more involved for factors than for numerical precitors. Indeed, for the
latter we have only two possibilities (leave or delete a predictor). However, for a
categorical predictor we can either delete it or merge some of its levels. In the latter
case a number of possibilities grows very quickly. For instance, consider the factor
corresponding to a continent that a person (client, patient etc.) lives or a company
(university etc.) is located. This factor has 6 levels (Antarctica is not considered),
which gives 203 possibilities (they are usually called partitions).
Due to that, it is really difficult to develope efficient algorithms for categori-
cal data and investigate their statistical properties. It is especially apparent in the
case that a number of factors and/or numbers of their levels are large. Thus, this
topic has not been intensively studied and the corresponding literature has been
relatively modest. However, categorical data are so common that it must have
changed. Indeed, we have found many papers investigating categorical data from
last few years, among others Garc´
ıa-Donato and Paulo [11], Pauger and Wagner
[23], Rabinowicz and Rosset [28,27], Simchoni and Rosset [31], Stokell et al.
[33].
In the current paper we consider high-dimensional selection and sparse pre-
diction, where a number of active variables is significantly smaller than learning
sample size nand number of all variables pgreatly exceeds n. Popular methods
fitting sparse predictive models with high-dimensional data do not merge levels of
factors: the Lasso [34] treats dummy variables as separate, binary predictors, the
Group Lasso [40] can only leave or delete a whole factor and the Sparse Group
Lasso [32] additionally removes levels of selected factors. The Fused Lasso [35]
can merge levels, but only in a simplified case that variables are ordered. These
methods significantly reduce a number of parameters and select variables, which
may be an input for further, interpretable dimension reduction techniques. How-
ever, they do not realize partition selection. It means that they cannot choose mod-
els consisting of subsets of numerical variables and partitions of levels of factors.
In the mainstream research on the Lasso-type algorithms, the CAS-ANOVA
method [5] fits sparse linear models with fusion of factor levels using the l1penalty
imposed on differences between parameters corresponding to levels of a factor. The
implementation of CAS-ANOVA is provided in Gertheiss and Tutz [12], Oelker
et al. [22]. An alternative to the penalization is a greedy search algorithm from
Maj-Ka´
nska et al. [20]. As we have already mentioned, a growing interest in par-
tition selection has been noticed recently. Pauger and Wagner [23] introduced a
Bayesian method for linear models based on a prior inducing fusion of levels. An-
2
other approach trying to solve the problem from the Bayesian perspective is consid-
ered in Garc´
ıa-Donato and Paulo [11]. The frequentist method using linear mixed
models was presented in Simchoni and Rosset [31], where factors were treated as
random effects. A partition selection algorithm called SCOPE, which is based on
a regularized likelihood, can be found in Stokell et al. [33]. This procedure uses
a minimax concave penalty on differences between consecutive, sorted estimators
of coefficients for levels of a factor. Finally, trees-based algorithms are applied to
categorical data in Rabinowicz and Rosset [27].
Let us note that all methods from the above paragraph, which make partition se-
lection are restricted to a classical scenario p<n, except SCOPE and DMR, which
in its new implementation is based on variables screened by the Group Lasso [26].
In this paper, we present an improved as well as simplified version of the DMR
algorithm, called PDMR (Plain DMR), which is much simpler for a theoretical
research and also works better for numerical experiments, in particular:
1. We propose the following two-step procedure PDMR: first, we reduce data
dimensionality using the Group Lasso; then to choose the final model we use an
information criterion on a small family of models prepared by clustering levels of
individual factors based on the Group Lasso estimator. We prove in Theorem 1
that under weak conditions that PDMR returns a sparse linear model containing
the true model, even if p >> n. The proof is based on a new bound of a num-
ber of partitions by generalized Poisson moments. It is worth to note that so far
there are no theoretical results regarding the correctness of the DMR selection for
high-dimensional data, while for SCOPE, a weaker property than selection consis-
tency was proved in Stokell et al. [33]. Our result is also weaker than selection
consistency, but it relates directly to an output of our algorithm, while results from
Stokell et al. [33, Theorem 6] concerns one of blockwise optima of their objective
function, so an output of their algorithm need not possess this property.
2. To investigate theoretical properties of PDMR we require probabilistic in-
equalities for an estimation error of Group Lasso. Such results exist in the litera-
ture, but when applied to partition selection they give sub-optimal results. There-
fore, in Subsection 3.1 we prove a novel bound for the lestimation error of the
weighted Group Lasso.
3. As a by-product of results from Subsection 3.1 we obtain optimal weights
for the Group Lasso, which are different from those recommended by the authors of
this method [40]. Possibly, the new weights can improve asymptotics of the Group
Lasso in the general scenario (not necessarily orthogonal, which is considered in
Subsection 3.1) and its practical performance as well.
4. In theoretical considerations the Lasso-type algorithms are defined for one
penalty and return one coefficient estimator. However, practical implementations
usually use nets of data-driven penalties and return lists of estimators. Our next
3
contribution is an analogous implementation of the PDMR algorithm. Next, in
numerical experiments on high-dimension data, we compare PDMR, DMR and
SCOPE. We show on a set of linear models (in wider scenarios than in Stokell et al.
[33]) that PDMR give results comparable or better in terms of RMSE than SCOPE
and DMR. While on a set of several real data sets we observe that the results of
these three algorithms are similar. Moreover, PDMR and DMR are several dozen
times faster than SCOPE.
In the rest of this paper we describe the considered models and the PDMR al-
gorithm. We also present mathematical propositions with proofs, which describe
properties of our method. Finally, we compare PDMR to other methods in numer-
ical experiments. All proofs are relegated to supplementary materials.
2 Linear models and the algorithm
We consider independent data (y1, x1.),(y2, x2.),...,(yn, xn.), where yiRis
a response variable and xi. Rpis a vector of predictors. Every vector of pre-
dictors xi. can consist of continuous predictors as well as categorical predictors.
We arrange them in the following way xi. = (xT
i1, xT
i2, . . . , xT
ir).TSuppose that xik
corresponds to a categorical predictor (factor) for some k∈ {1, . . . , r}.A set of
levels of this factor is given by {0,1,2, . . . , pk}and xik ∈ {0,1}pkis a dummy
vector corresponding to the k-th predictor of the i-th object in a data set. So, a
reference level (say, the zero level) is not included in xik.The only exception re-
lates to the first factor, whose reference level is contained in xi1.This special level
plays a role of an intercept. If necessary, we can rearrange vectors of predictors in
such a way to have the first factor with k= 1.If xik corresponds to a continuous
predictor, then simply xik Rpkand pk= 1.Therefore, a dimension of xi. is
p= 1 + Pr
k=1 pk.Finally, let X= [x1., . . . , xn.]Tbe a n×pdesign matrix.
We consider a linear model
yi=xT
i. ˚
β+εifor i= 1,2, . . . , n. (1)
Coordinates of ˚
βcorrespond to coordinates of a vector of predictors, that is ˚
β=
(˚
βT
1,˚
βT
2,...,˚
βT
r)T,where ˚
β1= (˚
β0,1,˚
β1,1,...,˚
βp1,1)TRp1+1 and ˚
βk= (˚
β1,k,˚
β2,k,˚
β3,k,...,˚
βpk,k)T
Rpkfor k= 2, . . . , r. Moreover, we suppose that noise variables εihave a subgaus-
sian distribution with the same number σ > 0, that is for i= 1,2, . . . , n and uR
we have
Eexp(i)exp(σ2u2/2).(2)
The main examples of subgaussian noise variables are normal variables or those
having bounded supports.
4
2.1 Notations
Let W1=diag(w0,1, w1,1, . . . , wp1,1)and Wk=diag(w1,k, . . . , wpk,k), k =
2, . . . , r be diagonal nonrandom matrices with positive entries. Besides, W=
diag(W1, . . . , Wr)is a p×pdiagonal matrix with matrices Wkon the diagonal.
Next, for βRpand q1let |β|q= (Pp
j=1 |βj|q)1/q be the `qnorm of β. The
only exception is the `2norm, for which we use the special notation ||β||.
A feasible model is defined as a sequence M= (P1, P2, . . . , Pr).If the k-
th predictor is a factor, then Pkis a particular partition of its levels. If the k-th
predictor is continuous, then Pk∈ {∅,{k}}.To make the notation coherent and
concise we artificially augment each βRpby β0,k = 0, k = 2, . . . , r. Notice
that every βdetermines a model Mβas follows: if the k-th predictor is a factor,
then partition Pkdepends on the fact, whether equalities βj1,k =βj2,k, j16=j2are
satisfied, which corresponds to merging levels j1and j2. If the k-th predictor is
continuous, then Pk={k}when βk6= 0 and Pk=otherwise.
In the following we consider k∈ {1, . . . , r}and for k= 2, . . . , r we have
j∈ {1, . . . , pk},while for k= 1 we have j∈ {0, . . . , p1}.Let xj,k be a column
of Xcorresponding to the j-th level ofthe k-th factor. The additional notations are
xM= maxj,k ||xj,k||, xm= minj,k ||xj,k||, xW= maxj,k ||xj,k||/wj,k.Finally,
we define a crucial object ∆ = min
1krmin
0j1,j2pk:˚
βj1,k6=˚
βj2,k |˚
βj1,k ˚
βj2,k|.It is
involved in an analog of the beta-min condition, which plays an important role
when variable selection with continuous predictors is considered [7,39].
2.2 The algorithm
To simplify notations (and without loss of generality), we suppose that all consid-
ered predictors are categorical. For estimation of ˚
βwe consider a quadratic loss
function
`(β) =
n
X
i=1
[(xT
i. β)2/2yixT
i. β](3)
as in maximum likelihood estimation. It is easy to see that ˙
`(β) = Pn
i=1(xT
i. β
yi)xi., where ˙
`denotes a derivative of `. Besides, ˙
`(˚
β) = XTεfor ε= (ε1, . . . , εn)T.
Next, for k= 1, . . . , r partial derivatives of `(β)corresponding to coordinates of
βkare denoted by ˙
`k(β).
We present PDMR, which consists of two steps:
(1) Screening: we compute the weighted Group Lasso
ˆ
β= arg min
β`(β) + λ
r
X
k=1 ||Wkβk||,
5
摘要:

ImprovingGroupLassoforhigh-dimensionalcategoricaldataSzymonNowakowski*†,PiotrPokarowski‡,WojciechRejchel§,AgnieszkaSotys¶AbstractSparsemodellingormodelselectionwithcategoricaldataischallengingevenforamoderatenumberofvariables,becauseoneparameterisroughlyneededtoencodeonecategoryorlevel.TheGroupLass...

展开>> 收起<<
Improving Group Lasso for high-dimensional categorical data Szymon Nowakowski Piotr Pokarowski.pdf

共29页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:29 页 大小:465.43KB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 29
客服
关注