Block-wise Primal-dual Algorithms for Large-scale Doubly Penalized ANOVA Modeling Penghui Fu1and Zhiqiang Tan1

2025-05-06 0 0 1.53MB 61 页 10玖币

侵权投诉

Block-wise Primal-dual Algorithms for Large-scale Doubly

Penalized ANOVA Modeling

Penghui Fu1and Zhiqiang Tan1

October 21, 2022

Abstract. For multivariate nonparametric regression, doubly penalized ANOVA modeling

(DPAM) has recently been proposed, using hierarchical total variations (HTVs) and empir-

ical norms as penalties on the component functions such as main eﬀects and multi-way in-

teractions in a functional ANOVA decomposition of the underlying regression function. The

two penalties play complementary roles: the HTV penalty promotes sparsity in the selec-

tion of basis functions within each component function, whereas the empirical-norm penalty

promotes sparsity in the selection of component functions. We adopt backﬁtting or block

minimization for training DPAM, and develop two suitable primal-dual algorithms, including

both batch and stochastic versions, for updating each component function in single-block

optimization. Existing applications of primal-dual algorithms are intractable in our setting

with both HTV and empirical-norm penalties. Through extensive numerical experiments, we

demonstrate the validity and advantage of our stochastic primal-dual algorithms, compared

with their batch versions and a previous active-set algorithm, in large-scale scenarios.

Key words and phrases. ANOVA modeling; Nonparametric regression; Penalized esti-

mation; Primal-dual algorithms; Stochastic algorithms; Stochastic gradient methods; Total

variation.

1Department of Statistics, Rutgers University. Address: 110 Frelinghuysen Road, Piscataway, NJ 08854.

E-mails: penghui.fu@rutgers.edu, ztan@stat.rutgers.edu.

arXiv:2210.10991v1 [stat.CO] 20 Oct 2022

1 Introduction

Consider functional analysis-of-variance (ANOVA) modeling for multivariate nonparametric

regression (e.g., Gu (2013)). Let Yiand Xi= (Xi1, . . . , Xip)T,i= 1, . . . , n, be a collection

of independent observations of a response variable and a covariate vector. For continuous

responses, nonparametric regression can be stated such that

Yi=f(Xi1, . . . , Xip) + εi,(1)

where f(x) = f(x1, . . . , xp) is an unknown function, and εiis a noise with mean zero and a

ﬁnite variance given Xi. In the framework of functional ANOVA modeling, the multivariate

function fis decomposed as

f(x1, . . . , xp) = f0+X

1≤j1≤p

fj1(xj1) + X

1≤j1<j2≤p

fj1,j2(xj1, xj2) + ···

1≤j1<···<jK≤p

fj1,...,jK(xj1,··· , xjK),(2)

where f0is a constant, fj1’s are univariate functions representing main eﬀects, fj1,j2’s are

bivariate functions representing two-way interactions, etc, and Kis the maximum way of

interactions allowed. The special case of (1)–(2) with K= 1 is known as additive modeling

(Stone, 1986; Hastie & Tibshirani, 1990). For general K≥2, a notable example is smooth-

ing spline ANOVA modeling (Wahba et al., 1995; Lin & Zhang, 2006; Gu, 2013), where the

component functions in (2) are assumed to lie in tensor-product reproducing kernel Hilbert

spaces (RKHSs), deﬁned from univariate Sobolev-L2spaces as in smoothing splines. Alter-

natively, in Yang & Tan (2021), a class of hierarchical total variations (HTVs) is introduced

to measure roughness of main eﬀects and multi-way interactions by properly extending the

total variation associated with the univariate Sobolev-L1space. Compared with smooth-

ing spline ANOVA modeling, this approach extends univariate regression splines using total

variation penalties (Mammen & Van De Geer, 1997).

Recently, theory and methods have been expanded for additive and ANOVA modeling to

high-dimensional settings, with pclose to or greater than n. Examples include Ravikumar

et al. (2009), Meier et al. (2009), Koltchinskii & Yuan (2010), Radchenko & James (2010),

Raskutti et al. (2012), Petersen et al. (2016), Tan & Zhang (2019), and Yang & Tan (2018,

2021) among others. An important idea from the high-dimensional methods is to employ

empirical L2norms of component functions as penalties, in addition to functional semi-norms

which measure roughness of the component functions, such as the Sobolev-L2semi-norm or

total variation for univariate functions in the case of additive modeling (K= 1). For K≥2,

the incorporation of empirical-norm penalties is relevant even when pis relatively small

against n, because the total number of component functions in (2) scales as pK.

In this work, we are interested in doubly penalized ANOVA modeling (DPAM) in Yang

& Tan (2021), using both the HTVs and empirical L2norms as penalties on the component

functions in (2). To describe the method, the ANOVA decomposition (2) can be expressed

in a more compact notation as

f(x1, . . . , xp) = f0+X

S1:|S1|=1

fS1+X

S2:|S2|=2

fS2+··· +X

SK:|SK|=K

fSK,(3)

where Skis a subset of size kfrom {1, . . . , p}for k= 1, . . . , K, and fSk=fj1,...,jk(xj1, . . . , xjk)

if Sk={j1, . . . , jk}. For identiﬁability, the component functions are assumed to be uniquely

deﬁned as fSk= (Qj∈Sk(I−Hj)Qj /∈SkHj)f, where Hjis a marginalization operator over

xj, for example, deﬁned by averaging over Xij ’s in the training set. For m≥1, the HTV of

diﬀerentiation order m, denoted as HTVm, is deﬁned inductively in m. For example, for a

univariate function f=f(x1), HTV1(f) = TV(f) and HTV2(f) = HTV(D1f), where D1is

the diﬀerentiation operator in x1and TV is the standard total variation as in Mammen &

Van De Geer (1997). For a bivariate function f=f(x1, x2), the HTV with m= 1 or 2 is

considerably more complicated (even after ignoring scaling constants):

HTV1(f) = TV(f12) + TV(f1) + TV(f2),

HTV2(f) = HTV1(D1D2f12) + HTV1(D1f1) + HTV1(D2f2),

where (f1, f2, f12) are the component functions in the ANOVA decomposition (3). Never-

theless, suitable basis functions are derived in Yang & Tan (2021) to achieve the following

properties in a set of multivariate splines, deﬁned as the union of all tensor products of up to

Ksets of univariate splines, depending on diﬀerentiation order mand pre-speciﬁed marginal

knots over each coordinate. First, the ANOVA decomposition (3) for such a multivariate

spline fcan be represented as

f=β0+

k=1 X

Sk:|Sk|=k

ΨT

SkβSk,(4)

with f0=β0and fSk= ΨT

SkβSk, where ΨSkis the basis vector in (xj1, . . . , xjk) and βSkis the

associated coeﬃcient vector for Sk={j1, . . . , jk}and k= 1, . . . , K. Moreover, the HTV can

be transformed into a Lasso representation, with HTVm(fSk) = kΓSkβSkk1:

HTVm(f) =

k=1 X

Sk:|Sk|=kkΓSkβSkk1,(5)

where k · k1denotes the L1norm of a vector, ρkis a scaling constant for HTV of k-variate

component fSk, and ΓSkis a diagonal matrix with each diagonal element either 0 or ρk

(with 0 indicating that the corresponding element of βSkis not penalized, for example, the

coeﬃcient of a fully linear basis in the case of m= 2). For m= 1 or 2, the univariate

spline bases are piecewise constant or linear, and the resulting multivariate spline bases are,

respectively, piecewise constant or cross-linear (i.e., being piecewise linear in each coordinate

with all other coordinates ﬁxed). Readers are referred to Yang & Tan (2021) for further

details, although such details are not required in our subsequent discussion.

With the preceding background, linear DPAM is deﬁned by solving

min

2kY−fk2

k=1 X

Sk:|Sk|=kHTVm(fSk) + λkkfSkkn,

or, with the representations (4) and (5), by solving

min

2



Y−β0−

k=1 X

Sk:|Sk|=k

ΨT

SkβSk



k=1 X

Sk:|Sk|=kkΓSkβSkk1+λkkΨT

SkβSkkn,(6)

where βconsists of β0and (βSk)k,Sk,k·kndenotes the empirical L2norm (or in short, empirical

norm), e.g., kfkn={n−1Pn

i=1 f2(Xi)}1/2, and λkis a tuning parameter for k= 1, . . . , K. As

reﬂected in the name DPAM, there are two penalty terms involved, the HTV kΓSkβSkk1and

the empirical norm λkkΨT

SkβSkkn, which play diﬀerent roles in promoting sparsity; see Propo-

sition 2 and related discussion. The HTV penalty promotes sparsity among the elements

of βSk(or the selection of basis functions) for each block Sk, whereas the empirical-norm

penalty promotes sparsity among the coeﬃcient vectors βSk(or the component functions

fSk) across Sk. The use of these two penalties is also instrumental in the theory for doubly

penalized estimation in additive or ANOVA modeling (Tan & Zhang, 2019).

For binary outcomes, replacing (1) by (nonparametric) logistic regression, P(Yi= 1|Xi) =

{1+exp(−f(Xi))}−1, and the square loss in (6) by the likelihood loss (or cross-entropy) leads

to logistic DPAM, which solves

min

i=1

lYi, β0+X

k,Sk

ΨT

i,SkβSk+X

k,SkkΓSkβSkk1+λkkΨSkβSkkn,(7)

where Yi∈ {0,1}, Ψi,Sk= ΨSk(Xi,j1, . . . , Xi,jk) for Sk={j1, . . . , jk}, and l(y, f) = log(1 +

exp(f)) −yf is the likelihood loss for logistic regression.

We develop new optimization algorithms for solving (6) and (7), including stochastic algo-

rithms adaptive to large-scale scenarios (with large sample size n). Similarly as in the earlier

literature (Hastie & Tibshirani, 1990), the top-level idea is backﬁtting or block minimization:

the objective is optimized with respect to one block βSkat a time while ﬁxing the remaining

blocks. This approach is valid for solving (6) and (7) due to the block-wise separability of

non-diﬀerentiable terms (Tseng, 1988). In Yang & Tan (2018, 2021), a backﬁtting algorithm,

called AS-BDT (active-set block descent and thresholding), is proposed by exploiting two

supportive properties. First, in the subproblem with respect to only βSk, the solution can

be obtained by jointly soft-thresholding an (exact) solution to the Lasso subproblem with

only the penalty kΓSkβSkk1; see Proposition 2. Second, the desired Lasso solution can be

computed using an active-set descent algorithm (Osborne et al., 2000), which tends to be ef-

ﬁcient when the training dataset is relatively small or the Lasso solution is suﬃciently sparse

within each block. However, AS-BDT is not designed to eﬃciently handle large datasets,

especially in applications with non-sparse blocks.

As a large-scale alternative to AS-BDT, we investigate primal-dual algorithms, including

both batch and stochastic versions, for solving the single-block subproblems in (6) and

(7). (We also study a new majorization-minimization algorithm, which is presented in the

Supplement.) The primal-dual methods have been extensively studied for large-scale convex

optimization, as shown in the book Ryu & Yin (2022). However, existing applications of

primal-dual algorithms are intractable in our setting, with both HTV/Lasso and empirical-

norm penalties. In particular, SPDC (stochastic primal-dual coordinate method) in Zhang

& Xiao (2017) can be seen as a stochastic version of the Chambolle–Pock (CP) algorithm

(Esser et al., 2010; Chambolle & Pock, 2011). But SPDC requires the objective to be split

into an empirical risk term (f-term), for example, the square loss in (6), and a penalty

term (g-term) such that its proximal mapping can be easily evaluated. Such a choice would

combine the HTV/Lasso and empirical-norm penalties into a single penalty term, for which

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Block-wisePrimal-dualAlgorithmsforLarge-scaleDoublyPenalizedANOVAModelingPenghuiFu1andZhiqiangTan1October21,2022Abstract.Formultivariatenonparametricregression,doublypenalizedANOVAmodeling(DPAM)hasrecentlybeenproposed,usinghierarchicaltotalvariations(HTVs)andempir-icalnormsaspenaltiesonthecomponentf...

展开>> 收起<<

Block-wise Primal-dual Algorithms for Large-scale Doubly Penalized ANOVA Modeling Penghui Fu1and Zhiqiang Tan1.pdf

共61页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Block-wise Primal-dual Algorithms for Large-scale Doubly Penalized ANOVA Modeling Penghui Fu1and Zhiqiang Tan1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: