Improving Group Lasso for high-dimensional categorical data Szymon Nowakowski Piotr Pokarowski

2025-05-08 0 0 465.43KB 29 页 10玖币

侵权投诉

Improving Group Lasso for high-dimensional

categorical data

Szymon Nowakowski *

†

, Piotr Pokarowski ‡

Wojciech Rejchel §

, Agnieszka Sołtys ¶

Abstract

Sparse modelling or model selection with categorical data is challenging

even for a moderate number of variables, because one parameter is roughly

needed to encode one category or level. The Group Lasso is a well known

efﬁcient algorithm for selection continuous or categorical variables, but all

estimates related to a selected factor usually differ. Therefore, a ﬁtted model

may not be sparse, which makes the model interpretation difﬁcult. To ob-

tain a sparse solution of the Group Lasso we propose the following two-step

procedure: ﬁrst, we reduce data dimensionality using the Group Lasso; then

to choose the ﬁnal model we use an information criterion on a small family

of models prepared by clustering levels of individual factors. We investigate

selection correctness of the algorithm in a sparse high-dimensional scenario.

We also test our method on synthetic as well as real datasets and show that it

performs better than the state of the art algorithms with respect to the predic-

tion accuracy or model dimension.

1 Introduction

Data sets containing categorical variables (factors) are common in statistics and

machine learning. The analysis of such data sets is much more challenging than

for those having only numerical variables. There are two main reasons of that:

*Faculty of Physics, University of Warsaw, Pasteura 5, 02-093 Warsaw, Poland

†Institute of Applied Mathematics and Mechanics, University of Warsaw, Banacha 2, 02-097

Warsaw, Poland , sd.nowakowski2@uw.edu.pl

‡Institute of Applied Mathematics and Mechanics, University of Warsaw, Banacha 2, 02-097

Warsaw, Poland, pokar@mimuw.edu.pl

§Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Chopina 12/18,

87-100, Toru´

n, Poland, wrejchel@gmail.com

¶Institute of Applied Mathematics and Mechanics, University of Warsaw, Banacha 2, 02-097

Warsaw, Poland, agnieszkaprochenka@gmail.com

arXiv:2210.14021v3 [stat.ME] 11 Nov 2022

the ﬁrst one relates to the fact that a factor with klevels is usually encoded as

k−1dummy variables, so k−1parameters are needed to learn (there is only one

such parameter for a numerical predictor). Therefore, dimensionality reduction

is crucial, which becomes the second reason. Namely, dimensionality reduction

is much more involved for factors than for numerical precitors. Indeed, for the

latter we have only two possibilities (leave or delete a predictor). However, for a

categorical predictor we can either delete it or merge some of its levels. In the latter

case a number of possibilities grows very quickly. For instance, consider the factor

corresponding to a continent that a person (client, patient etc.) lives or a company

(university etc.) is located. This factor has 6 levels (Antarctica is not considered),

which gives 203 possibilities (they are usually called partitions).

Due to that, it is really difﬁcult to develope efﬁcient algorithms for categori-

cal data and investigate their statistical properties. It is especially apparent in the

case that a number of factors and/or numbers of their levels are large. Thus, this

topic has not been intensively studied and the corresponding literature has been

relatively modest. However, categorical data are so common that it must have

changed. Indeed, we have found many papers investigating categorical data from

last few years, among others Garc´

ıa-Donato and Paulo [11], Pauger and Wagner

[23], Rabinowicz and Rosset [28,27], Simchoni and Rosset [31], Stokell et al.

[33].

In the current paper we consider high-dimensional selection and sparse pre-

diction, where a number of active variables is signiﬁcantly smaller than learning

sample size nand number of all variables pgreatly exceeds n. Popular methods

ﬁtting sparse predictive models with high-dimensional data do not merge levels of

factors: the Lasso [34] treats dummy variables as separate, binary predictors, the

Group Lasso [40] can only leave or delete a whole factor and the Sparse Group

Lasso [32] additionally removes levels of selected factors. The Fused Lasso [35]

can merge levels, but only in a simpliﬁed case that variables are ordered. These

methods signiﬁcantly reduce a number of parameters and select variables, which

may be an input for further, interpretable dimension reduction techniques. How-

ever, they do not realize partition selection. It means that they cannot choose mod-

els consisting of subsets of numerical variables and partitions of levels of factors.

In the mainstream research on the Lasso-type algorithms, the CAS-ANOVA

method [5] ﬁts sparse linear models with fusion of factor levels using the l1penalty

imposed on differences between parameters corresponding to levels of a factor. The

implementation of CAS-ANOVA is provided in Gertheiss and Tutz [12], Oelker

et al. [22]. An alternative to the penalization is a greedy search algorithm from

Maj-Ka´

nska et al. [20]. As we have already mentioned, a growing interest in par-

tition selection has been noticed recently. Pauger and Wagner [23] introduced a

Bayesian method for linear models based on a prior inducing fusion of levels. An-

other approach trying to solve the problem from the Bayesian perspective is consid-

ered in Garc´

ıa-Donato and Paulo [11]. The frequentist method using linear mixed

models was presented in Simchoni and Rosset [31], where factors were treated as

random effects. A partition selection algorithm called SCOPE, which is based on

a regularized likelihood, can be found in Stokell et al. [33]. This procedure uses

a minimax concave penalty on differences between consecutive, sorted estimators

of coefﬁcients for levels of a factor. Finally, trees-based algorithms are applied to

categorical data in Rabinowicz and Rosset [27].

Let us note that all methods from the above paragraph, which make partition se-

lection are restricted to a classical scenario p<n, except SCOPE and DMR, which

in its new implementation is based on variables screened by the Group Lasso [26].

In this paper, we present an improved as well as simpliﬁed version of the DMR

algorithm, called PDMR (Plain DMR), which is much simpler for a theoretical

research and also works better for numerical experiments, in particular:

1. We propose the following two-step procedure PDMR: ﬁrst, we reduce data

dimensionality using the Group Lasso; then to choose the ﬁnal model we use an

information criterion on a small family of models prepared by clustering levels of

individual factors based on the Group Lasso estimator. We prove in Theorem 1

that under weak conditions that PDMR returns a sparse linear model containing

the true model, even if p >> n. The proof is based on a new bound of a num-

ber of partitions by generalized Poisson moments. It is worth to note that so far

there are no theoretical results regarding the correctness of the DMR selection for

high-dimensional data, while for SCOPE, a weaker property than selection consis-

tency was proved in Stokell et al. [33]. Our result is also weaker than selection

consistency, but it relates directly to an output of our algorithm, while results from

Stokell et al. [33, Theorem 6] concerns one of blockwise optima of their objective

function, so an output of their algorithm need not possess this property.

2. To investigate theoretical properties of PDMR we require probabilistic in-

equalities for an estimation error of Group Lasso. Such results exist in the litera-

ture, but when applied to partition selection they give sub-optimal results. There-

fore, in Subsection 3.1 we prove a novel bound for the l∞estimation error of the

weighted Group Lasso.

3. As a by-product of results from Subsection 3.1 we obtain optimal weights

for the Group Lasso, which are different from those recommended by the authors of

this method [40]. Possibly, the new weights can improve asymptotics of the Group

Lasso in the general scenario (not necessarily orthogonal, which is considered in

Subsection 3.1) and its practical performance as well.

4. In theoretical considerations the Lasso-type algorithms are deﬁned for one

penalty and return one coefﬁcient estimator. However, practical implementations

usually use nets of data-driven penalties and return lists of estimators. Our next

contribution is an analogous implementation of the PDMR algorithm. Next, in

numerical experiments on high-dimension data, we compare PDMR, DMR and

SCOPE. We show on a set of linear models (in wider scenarios than in Stokell et al.

[33]) that PDMR give results comparable or better in terms of RMSE than SCOPE

and DMR. While on a set of several real data sets we observe that the results of

these three algorithms are similar. Moreover, PDMR and DMR are several dozen

times faster than SCOPE.

In the rest of this paper we describe the considered models and the PDMR al-

gorithm. We also present mathematical propositions with proofs, which describe

properties of our method. Finally, we compare PDMR to other methods in numer-

ical experiments. All proofs are relegated to supplementary materials.

2 Linear models and the algorithm

We consider independent data (y1, x1.),(y2, x2.),...,(yn, xn.), where yi∈Ris

a response variable and xi. ∈Rpis a vector of predictors. Every vector of pre-

dictors xi. can consist of continuous predictors as well as categorical predictors.

We arrange them in the following way xi. = (xT

i1, xT

i2, . . . , xT

ir).TSuppose that xik

corresponds to a categorical predictor (factor) for some k∈ {1, . . . , r}.A set of

levels of this factor is given by {0,1,2, . . . , pk}and xik ∈ {0,1}pkis a dummy

vector corresponding to the k-th predictor of the i-th object in a data set. So, a

reference level (say, the zero level) is not included in xik.The only exception re-

lates to the ﬁrst factor, whose reference level is contained in xi1.This special level

plays a role of an intercept. If necessary, we can rearrange vectors of predictors in

such a way to have the ﬁrst factor with k= 1.If xik corresponds to a continuous

predictor, then simply xik ∈Rpkand pk= 1.Therefore, a dimension of xi. is

p= 1 + Pr

k=1 pk.Finally, let X= [x1., . . . , xn.]Tbe a n×pdesign matrix.

We consider a linear model

yi=xT

i. ˚

β+εifor i= 1,2, . . . , n. (1)

Coordinates of ˚

βcorrespond to coordinates of a vector of predictors, that is ˚

β=

(˚

βT

1,˚

βT

2,...,˚

βT

r)T,where ˚

β1= (˚

β0,1,˚

β1,1,...,˚

βp1,1)T∈Rp1+1 and ˚

βk= (˚

β1,k,˚

β2,k,˚

β3,k,...,˚

βpk,k)T∈

Rpkfor k= 2, . . . , r. Moreover, we suppose that noise variables εihave a subgaus-

sian distribution with the same number σ > 0, that is for i= 1,2, . . . , n and u∈R

we have

Eexp(uεi)≤exp(σ2u2/2).(2)

The main examples of subgaussian noise variables are normal variables or those

having bounded supports.

2.1 Notations

Let W1=diag(w0,1, w1,1, . . . , wp1,1)and Wk=diag(w1,k, . . . , wpk,k), k =

2, . . . , r be diagonal nonrandom matrices with positive entries. Besides, W=

diag(W1, . . . , Wr)is a p×pdiagonal matrix with matrices Wkon the diagonal.

Next, for β∈Rpand q≥1let |β|q= (Pp

j=1 |βj|q)1/q be the `qnorm of β. The

only exception is the `2norm, for which we use the special notation ||β||.

A feasible model is deﬁned as a sequence M= (P1, P2, . . . , Pr).If the k-

th predictor is a factor, then Pkis a particular partition of its levels. If the k-th

predictor is continuous, then Pk∈ {∅,{k}}.To make the notation coherent and

concise we artiﬁcially augment each β∈Rpby β0,k = 0, k = 2, . . . , r. Notice

that every βdetermines a model Mβas follows: if the k-th predictor is a factor,

then partition Pkdepends on the fact, whether equalities βj1,k =βj2,k, j16=j2are

satisﬁed, which corresponds to merging levels j1and j2. If the k-th predictor is

continuous, then Pk={k}when βk6= 0 and Pk=∅otherwise.

In the following we consider k∈ {1, . . . , r}and for k= 2, . . . , r we have

j∈ {1, . . . , pk},while for k= 1 we have j∈ {0, . . . , p1}.Let xj,k be a column

of Xcorresponding to the j-th level ofthe k-th factor. The additional notations are

xM= maxj,k ||xj,k||, xm= minj,k ||xj,k||, xW= maxj,k ||xj,k||/wj,k.Finally,

we deﬁne a crucial object ∆ = min

1≤k≤rmin

0≤j1,j2≤pk:˚

βj1,k6=˚

βj2,k |˚

βj1,k −˚

βj2,k|.It is

involved in an analog of the beta-min condition, which plays an important role

when variable selection with continuous predictors is considered [7,39].

2.2 The algorithm

To simplify notations (and without loss of generality), we suppose that all consid-

ered predictors are categorical. For estimation of ˚

βwe consider a quadratic loss

function

`(β) =

i=1

[(xT

i. β)2/2−yixT

i. β](3)

as in maximum likelihood estimation. It is easy to see that ˙

`(β) = Pn

i=1(xT

i. β−

yi)xi., where ˙

`denotes a derivative of `. Besides, ˙

`(˚

β) = −XTεfor ε= (ε1, . . . , εn)T.

Next, for k= 1, . . . , r partial derivatives of `(β)corresponding to coordinates of

βkare denoted by ˙

`k(β).

We present PDMR, which consists of two steps:

(1) Screening: we compute the weighted Group Lasso

β= arg min

β`(β) + λ

k=1 ||Wkβk||,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ImprovingGroupLassoforhigh-dimensionalcategoricaldataSzymonNowakowski*,PiotrPokarowski,WojciechRejchel§,AgnieszkaSotys¶AbstractSparsemodellingormodelselectionwithcategoricaldataischallengingevenforamoderatenumberofvariables,becauseoneparameterisroughlyneededtoencodeonecategoryorlevel.TheGroupLass...

展开>> 收起<<

Improving Group Lasso for high-dimensional categorical data Szymon Nowakowski Piotr Pokarowski.pdf

共29页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Improving Group Lasso for high-dimensional categorical data Szymon Nowakowski Piotr Pokarowski

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: