Block-wise Primal-dual Algorithms for Large-scale Doubly Penalized ANOVA Modeling Penghui Fu1and Zhiqiang Tan1

2025-05-06 0 0 1.53MB 61 页 10玖币
侵权投诉
Block-wise Primal-dual Algorithms for Large-scale Doubly
Penalized ANOVA Modeling
Penghui Fu1and Zhiqiang Tan1
October 21, 2022
Abstract. For multivariate nonparametric regression, doubly penalized ANOVA modeling
(DPAM) has recently been proposed, using hierarchical total variations (HTVs) and empir-
ical norms as penalties on the component functions such as main effects and multi-way in-
teractions in a functional ANOVA decomposition of the underlying regression function. The
two penalties play complementary roles: the HTV penalty promotes sparsity in the selec-
tion of basis functions within each component function, whereas the empirical-norm penalty
promotes sparsity in the selection of component functions. We adopt backfitting or block
minimization for training DPAM, and develop two suitable primal-dual algorithms, including
both batch and stochastic versions, for updating each component function in single-block
optimization. Existing applications of primal-dual algorithms are intractable in our setting
with both HTV and empirical-norm penalties. Through extensive numerical experiments, we
demonstrate the validity and advantage of our stochastic primal-dual algorithms, compared
with their batch versions and a previous active-set algorithm, in large-scale scenarios.
Key words and phrases. ANOVA modeling; Nonparametric regression; Penalized esti-
mation; Primal-dual algorithms; Stochastic algorithms; Stochastic gradient methods; Total
variation.
1Department of Statistics, Rutgers University. Address: 110 Frelinghuysen Road, Piscataway, NJ 08854.
E-mails: penghui.fu@rutgers.edu, ztan@stat.rutgers.edu.
arXiv:2210.10991v1 [stat.CO] 20 Oct 2022
1 Introduction
Consider functional analysis-of-variance (ANOVA) modeling for multivariate nonparametric
regression (e.g., Gu (2013)). Let Yiand Xi= (Xi1, . . . , Xip)T,i= 1, . . . , n, be a collection
of independent observations of a response variable and a covariate vector. For continuous
responses, nonparametric regression can be stated such that
Yi=f(Xi1, . . . , Xip) + εi,(1)
where f(x) = f(x1, . . . , xp) is an unknown function, and εiis a noise with mean zero and a
finite variance given Xi. In the framework of functional ANOVA modeling, the multivariate
function fis decomposed as
f(x1, . . . , xp) = f0+X
1j1p
fj1(xj1) + X
1j1<j2p
fj1,j2(xj1, xj2) + ···
+X
1j1<···<jKp
fj1,...,jK(xj1,··· , xjK),(2)
where f0is a constant, fj1’s are univariate functions representing main effects, fj1,j2’s are
bivariate functions representing two-way interactions, etc, and Kis the maximum way of
interactions allowed. The special case of (1)–(2) with K= 1 is known as additive modeling
(Stone, 1986; Hastie & Tibshirani, 1990). For general K2, a notable example is smooth-
ing spline ANOVA modeling (Wahba et al., 1995; Lin & Zhang, 2006; Gu, 2013), where the
component functions in (2) are assumed to lie in tensor-product reproducing kernel Hilbert
spaces (RKHSs), defined from univariate Sobolev-L2spaces as in smoothing splines. Alter-
natively, in Yang & Tan (2021), a class of hierarchical total variations (HTVs) is introduced
to measure roughness of main effects and multi-way interactions by properly extending the
total variation associated with the univariate Sobolev-L1space. Compared with smooth-
ing spline ANOVA modeling, this approach extends univariate regression splines using total
variation penalties (Mammen & Van De Geer, 1997).
Recently, theory and methods have been expanded for additive and ANOVA modeling to
high-dimensional settings, with pclose to or greater than n. Examples include Ravikumar
et al. (2009), Meier et al. (2009), Koltchinskii & Yuan (2010), Radchenko & James (2010),
Raskutti et al. (2012), Petersen et al. (2016), Tan & Zhang (2019), and Yang & Tan (2018,
2021) among others. An important idea from the high-dimensional methods is to employ
1
empirical L2norms of component functions as penalties, in addition to functional semi-norms
which measure roughness of the component functions, such as the Sobolev-L2semi-norm or
total variation for univariate functions in the case of additive modeling (K= 1). For K2,
the incorporation of empirical-norm penalties is relevant even when pis relatively small
against n, because the total number of component functions in (2) scales as pK.
In this work, we are interested in doubly penalized ANOVA modeling (DPAM) in Yang
& Tan (2021), using both the HTVs and empirical L2norms as penalties on the component
functions in (2). To describe the method, the ANOVA decomposition (2) can be expressed
in a more compact notation as
f(x1, . . . , xp) = f0+X
S1:|S1|=1
fS1+X
S2:|S2|=2
fS2+··· +X
SK:|SK|=K
fSK,(3)
where Skis a subset of size kfrom {1, . . . , p}for k= 1, . . . , K, and fSk=fj1,...,jk(xj1, . . . , xjk)
if Sk={j1, . . . , jk}. For identifiability, the component functions are assumed to be uniquely
defined as fSk= (QjSk(IHj)Qj /SkHj)f, where Hjis a marginalization operator over
xj, for example, defined by averaging over Xij ’s in the training set. For m1, the HTV of
differentiation order m, denoted as HTVm, is defined inductively in m. For example, for a
univariate function f=f(x1), HTV1(f) = TV(f) and HTV2(f) = HTV(D1f), where D1is
the differentiation operator in x1and TV is the standard total variation as in Mammen &
Van De Geer (1997). For a bivariate function f=f(x1, x2), the HTV with m= 1 or 2 is
considerably more complicated (even after ignoring scaling constants):
HTV1(f) = TV(f12) + TV(f1) + TV(f2),
HTV2(f) = HTV1(D1D2f12) + HTV1(D1f1) + HTV1(D2f2),
where (f1, f2, f12) are the component functions in the ANOVA decomposition (3). Never-
theless, suitable basis functions are derived in Yang & Tan (2021) to achieve the following
properties in a set of multivariate splines, defined as the union of all tensor products of up to
Ksets of univariate splines, depending on differentiation order mand pre-specified marginal
knots over each coordinate. First, the ANOVA decomposition (3) for such a multivariate
spline fcan be represented as
f=β0+
K
X
k=1 X
Sk:|Sk|=k
ΨT
SkβSk,(4)
2
with f0=β0and fSk= ΨT
SkβSk, where ΨSkis the basis vector in (xj1, . . . , xjk) and βSkis the
associated coefficient vector for Sk={j1, . . . , jk}and k= 1, . . . , K. Moreover, the HTV can
be transformed into a Lasso representation, with HTVm(fSk) = kΓSkβSkk1:
HTVm(f) =
K
X
k=1 X
Sk:|Sk|=kkΓSkβSkk1,(5)
where k · k1denotes the L1norm of a vector, ρkis a scaling constant for HTV of k-variate
component fSk, and ΓSkis a diagonal matrix with each diagonal element either 0 or ρk
(with 0 indicating that the corresponding element of βSkis not penalized, for example, the
coefficient of a fully linear basis in the case of m= 2). For m= 1 or 2, the univariate
spline bases are piecewise constant or linear, and the resulting multivariate spline bases are,
respectively, piecewise constant or cross-linear (i.e., being piecewise linear in each coordinate
with all other coordinates fixed). Readers are referred to Yang & Tan (2021) for further
details, although such details are not required in our subsequent discussion.
With the preceding background, linear DPAM is defined by solving
min
β
1
2kYfk2
n+
K
X
k=1 X
Sk:|Sk|=kHTVm(fSk) + λkkfSkkn,
or, with the representations (4) and (5), by solving
min
β
1
2
Yβ0
K
X
k=1 X
Sk:|Sk|=k
ΨT
SkβSk
2
n+
K
X
k=1 X
Sk:|Sk|=kkΓSkβSkk1+λkkΨT
SkβSkkn,(6)
where βconsists of β0and (βSk)k,Sk,k·kndenotes the empirical L2norm (or in short, empirical
norm), e.g., kfkn={n1Pn
i=1 f2(Xi)}1/2, and λkis a tuning parameter for k= 1, . . . , K. As
reflected in the name DPAM, there are two penalty terms involved, the HTV kΓSkβSkk1and
the empirical norm λkkΨT
SkβSkkn, which play different roles in promoting sparsity; see Propo-
sition 2 and related discussion. The HTV penalty promotes sparsity among the elements
of βSk(or the selection of basis functions) for each block Sk, whereas the empirical-norm
penalty promotes sparsity among the coefficient vectors βSk(or the component functions
fSk) across Sk. The use of these two penalties is also instrumental in the theory for doubly
penalized estimation in additive or ANOVA modeling (Tan & Zhang, 2019).
For binary outcomes, replacing (1) by (nonparametric) logistic regression, P(Yi= 1|Xi) =
{1+exp(f(Xi))}1, and the square loss in (6) by the likelihood loss (or cross-entropy) leads
3
to logistic DPAM, which solves
min
β
1
n
n
X
i=1
lYi, β0+X
k,Sk
ΨT
i,SkβSk+X
k,SkkΓSkβSkk1+λkkΨSkβSkkn,(7)
where Yi∈ {0,1}, Ψi,Sk= ΨSk(Xi,j1, . . . , Xi,jk) for Sk={j1, . . . , jk}, and l(y, f) = log(1 +
exp(f)) yf is the likelihood loss for logistic regression.
We develop new optimization algorithms for solving (6) and (7), including stochastic algo-
rithms adaptive to large-scale scenarios (with large sample size n). Similarly as in the earlier
literature (Hastie & Tibshirani, 1990), the top-level idea is backfitting or block minimization:
the objective is optimized with respect to one block βSkat a time while fixing the remaining
blocks. This approach is valid for solving (6) and (7) due to the block-wise separability of
non-differentiable terms (Tseng, 1988). In Yang & Tan (2018, 2021), a backfitting algorithm,
called AS-BDT (active-set block descent and thresholding), is proposed by exploiting two
supportive properties. First, in the subproblem with respect to only βSk, the solution can
be obtained by jointly soft-thresholding an (exact) solution to the Lasso subproblem with
only the penalty kΓSkβSkk1; see Proposition 2. Second, the desired Lasso solution can be
computed using an active-set descent algorithm (Osborne et al., 2000), which tends to be ef-
ficient when the training dataset is relatively small or the Lasso solution is sufficiently sparse
within each block. However, AS-BDT is not designed to efficiently handle large datasets,
especially in applications with non-sparse blocks.
As a large-scale alternative to AS-BDT, we investigate primal-dual algorithms, including
both batch and stochastic versions, for solving the single-block subproblems in (6) and
(7). (We also study a new majorization-minimization algorithm, which is presented in the
Supplement.) The primal-dual methods have been extensively studied for large-scale convex
optimization, as shown in the book Ryu & Yin (2022). However, existing applications of
primal-dual algorithms are intractable in our setting, with both HTV/Lasso and empirical-
norm penalties. In particular, SPDC (stochastic primal-dual coordinate method) in Zhang
& Xiao (2017) can be seen as a stochastic version of the Chambolle–Pock (CP) algorithm
(Esser et al., 2010; Chambolle & Pock, 2011). But SPDC requires the objective to be split
into an empirical risk term (f-term), for example, the square loss in (6), and a penalty
term (g-term) such that its proximal mapping can be easily evaluated. Such a choice would
combine the HTV/Lasso and empirical-norm penalties into a single penalty term, for which
4
摘要:

Block-wisePrimal-dualAlgorithmsforLarge-scaleDoublyPenalizedANOVAModelingPenghuiFu1andZhiqiangTan1October21,2022Abstract.Formultivariatenonparametricregression,doublypenalizedANOVAmodeling(DPAM)hasrecentlybeenproposed,usinghierarchicaltotalvariations(HTVs)andempir-icalnormsaspenaltiesonthecomponentf...

展开>> 收起<<
Block-wise Primal-dual Algorithms for Large-scale Doubly Penalized ANOVA Modeling Penghui Fu1and Zhiqiang Tan1.pdf

共61页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:61 页 大小:1.53MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 61
客服
关注