Ensemble Projection Pursuit for General Nonparametric Regression

2025-05-06 0 0 350.95KB 25 页 10玖币
侵权投诉
arXiv:2210.14467v2 [math.ST] 18 May 2023
Submitted to the Annals of Statistics
ENSEMBLE PROJECTION PURSUIT FOR GENERAL NONPARAMETRIC
REGRESSION
BYHAORAN ZHANa, MINGKE ZHANGbAND YINGCUN XIAc
Department of Statistics and Data Science, National University of Singapore
ahaoran.zhan@u.nus.edu ,bmingke.zhang@u.nus.edu ,cyingcun.xia@nus.edu.sg
The projection pursuit regression (PPR) has played an important role in
the development of statistics and machine learning. However, when compared
to other established methods like random forests (RF) and support vector
machines (SVM), PPR has yet to showcase a similar level of accuracy as a
statistical learning technique. In this paper, we revisit the estimation of PPR
and propose an optimal greedy algorithm and an ensemble approach via "fea-
ture bagging", hereafter referred to as ePPR, aiming to improve the efficacy.
Compared to RF, ePPR has two main advantages. Firstly, its theoretical con-
sistency can be proved for more general regression functions as long as they
are L2integrable, and higher consistency rates can be achieved. Secondly,
ePPR does not split the samples, and thus each term of PPR is estimated us-
ing the whole data, making the minimization more efficient and guaranteeing
the smoothness of the estimator. Extensive comparisons based on real data
sets show that ePPR is more efficient in regression and classification than
RF and other competitors. The efficacy of ePPR, a variant of Artificial Neu-
ral Networks (ANN), demonstrates that with suitable statistical tuning, ANN
can equal or even exceed RF in dealing with small to medium-sized datasets.
This revelation challenges the widespread belief that ANN’s superiority over
RF is limited to processing extensive sample sizes.
1. Introduction. The technique of projection pursuit, which was first introduced by [30]
and [21], aims to reveal the most significant structures within high-dimensional data by se-
quentially utilizing lower-dimensional projections. Once a projection has been identified, the
structure that lies along it is removed, and the procedure is then reiterated to detect the next
projection. This idea was later implemented for regression by [20], who referred to this tech-
nique as the projection pursuit regression (PPR) and popularized it in the 1980s. PPR is well-
suited for estimating general nonparametric regressions due to its capability to approximate
any general function [27,34,35].
More specifically, consider the general regression problem with response Yand predictor
XRp. Our main interest lies in the conditional mean, m(X) = E(Y|X). Following the
categorization of two cultures of [6], the projection pursuit regression can be used in two
ways. The first way to use PPR is by assuming model m(X) = g1(θ
1X) + ... +gK(θ
KX),
where Kis assumed to be a fixed and finite, yet unknown number. See, for example, [12],
[23] and [44]. It is evident that this statistical model may not be appropriate for real-world
data and is not recommended by [6]. The second way to use PPR is by approximating or
estimating a general nonparametric regression based on the following fact: for any smooth
function m, there exists a sequence of projection directions {θkRp:k= 1,2, ...}and a
MSC2020 subject classifications:Primary 62G08; secondary 62G20.
Keywords and phrases: convergence of algorithm, consistency of estimation, greedy algorithm, nonparametric
smoothing, projection pursuit regression, random forest.
1
2
sequence of ridge functions {gk(.), k = 1,2, ...}satisfying
(1.1) PPR expansion: m(X) =
X
k=1
gk(θ
kX)a.s..
Under mild conditions, [34] and [35] established that the PPR expansion holds for a general
function m(x)in the L2sense, and the PPR algorithm proposed in [20] is an implementation
of this expansion.
The PPR expansion can be estimated with two different approaches: the global minimiza-
tion and the greedy algorithm. The global minimization is to minimize the loss function over
all the ridge functions and projection directions, i.e.,
min
gkk,k=1,2,...
E Y
X
k=1
gk(θ
kX)!2
.
This is the classical estimation method in statistics and machine learning, including the esti-
mation of artificial neural networks (ANN). In practice, the number of terms in the expansion
must be finite and goes to infinity with sample size. With i.i.d. data {(Xi, Yi)}n
i=1, the esti-
mator based on the global minimization is ˆmn(X) = PKn
k=1 ˆgk(ˆ
θ
kX), where
(ˆgk,ˆ
θk) = arg min
gkVnkΘp
1
n
n
X
i=1 Yi
Kn
X
k=1
gk(θ
kXi)2
and Vnis an approximation function space which will be discussed later. Under some regu-
larity conditions, we can show that ˆmnachieves a nearly optimal consistency rate, i.e.,
E( ˆmn(X)m(X))2c(ln4/n)2r/(2r+p),
where c > 0and ris the degree of smoothness of m; see SUPPLEMENTARY MATERIALS
(SM) for the proof. This method has good statistical asymptotic properties, but in practice, its
advantages are hardly observed, perhaps because of the large number of parameters required
for estimation.
Instead of estimating gk(θ
kx), k = 1,2, ... together, the greedy algorithm estimates the
PPR expansion term by term in a sequential manner, which was originally introduced by [20]
in their PPR estimation procedure, as described below. Let
(1.2) θ1=arg min
θ:||θ||2=1
E(YE(Y|Xθ))2
and g1(v) = E(Y|Xθ1=v). For k2, let the residual δk=YPk1
ι=1 gι(Xθι), and
calculate
(1.3) θk=arg min
θ:||θ||2=1
E(δkE(δk|Xθ))2
and gk(v) = E(Y|Xθk=v). [26] proved that the PPR algorithm converges in the sense that
(1.4) Ehm(X)
K
X
k=1
gk(Xθk)
2i0, as K → ∞.
In the literature, there are various proposed improvements to the aforementioned greedy al-
gorithm. For instance, [27] proposed the relaxed greedy algorithm (RGA) as a means of
accelerating the convergence rate of the PPR algorithm. In RGA, the approximation obtained
in step ktakes the form of α·Pk1
ι=1 gι(XTθι) + (1 α)·gk(XTθι)for some α(0,1). The
orthogonal greedy algorithm (OGA), which is well-known in statistical learning and signal
ENSEMBLE PROJECTION PURSUIT 3
processing (see [9] and [3]), is another improvement. In OGA, once gk(XTθk)is obtained
in the k-th iteration, the approximation is updated by projecting m(X)onto the linear space
spanned by {gι(XTθι)}k
ι=1.
Although PPR has good algorithmic and mathematical guarantees, it has not gained popu-
larity among researchers, nor has it been extensively studied. Existing computer packages of
PPR do not provide good prediction accuracy in comparison with the other popular methods
such as RF and the support vector machine [15, SVM]. [24] suggested that PPR’s limited
popularity was due to its computational complexity during the 1980s, when computational
power was significantly lower than it is today. As a result, the artificial neural network (ANN)
became a more popular choice although they share some similarities. Nevertheless, with the
tremendous developments in computational power and algorithms, there should be a growing
interest in studying PPR more comprehensively. This is the primary motivation behind this
paper.
The purpose of this paper is to investigate the greedy algorithm of PPR approximation
and its convergence theory. To further improve the prediction accuracy of PPR, we propose a
boosting method, called ensemble PPR (ePPR) hereafter. The fundamental idea is to search
for projections in (1.2) and (1.3) in a randomly selected subset of X, in a similar way as
that RF selects split variables [7], which is also called “feature bagging” [25]. Thus, each
run of the algorithm produces a random PPR approximation of the unknown regression func-
tion. The ensemble technique is then employed to generate a final estimate of the regression
function.
1.1. Related work and our contribution. Our work includes two parts: the greedy algo-
rithm for PPR and a boosting method for a statistical implementation of PPR. Our contribu-
tion is summarized as follows.
An optimal greedy algorithm, called the Additive Greedy Algorithm (AGA). The algorithm
is motivated by the orthogonal greedy algorithm [33, OGA] and the additive structure of
the expansion. We will show that AGA achieves the optimal convergence rate of greedy
algorithms and is faster than the existing algorithms for PPR expansion (see [3]).
An ensemble estimation, called ensemble PPR (ePPR). This is partly motivated by RF of
[6]; see also [47] and [1] for the recent development of RF. ePPR shares one of the core
elements of RF: the former uses recursive projection and the latter recursive partitioning.
ePPR also borrows two of the other core elements of RF: random selection of predictors,
also known as feature bagging, and averaging the individual estimates for the ensemble.
Advantages over RF, the benchmark of nonparametric regression based on the greedy algo-
rithm. These advantages include: (i) As noted by [18], RF has limited ability to fit smooth
signals because of the step-functions inherited from the classification and regression tree
(CART). In contrast, PPR does not split the samples and uses the whole data in each step,
making the minimization more efficient and guaranteeing the estimator smooth. (ii) More-
over, while RF theoretically only works for some special classes of regression functions
(see, for example, [37] and [14]), PPR or ePPR can handle more general regression func-
tions as long as the underlying function is L2integrable. (iii) Additionally, ePPR has a
higher consistency rate than RF, as demonstrated in [28] for additive model and [14] for
slightly more general regression functions.
Our work also sheds light on the long debate about when ANN and RF have advantages
over each other. As a special variant of ANN, ePPR suggests that ANN can perform as
well as or even better than RF for data of small to medium size when statistical tuning is
incorporated. This contradicts the commonly received understanding of ANN and RF that
ANN can only have advantages for data with large sample size.
4
1.2. Organization of this paper. The rest of this paper is organized as follows. Section
2introduces some notations, and proposes the greedy algorithm, AGA, and studies its con-
vergence rate. Section 3considers the consistency of statistical estimators of AGA. Details
of ePPR are presented in Section 4; its statistical consistency is also studied in the section.
Extensive numerical studies of real data sets, including comparison with the popular machine
learning methods such as RF and SVM, are presented in Section 5. The technical proofs are
provided in Section 6, while some auxiliary results and lemmas used in the proofs are given
in SUPPLEMENTARY MATERIALS (SM) in a separate file. A brief discussion about the
problems with the proposed ePPR is given in Section 7.
2. PPR with greedy algorithm. A greedy algorithm always associates with a dictionary
Dic H, where His a Banach space equipped with norm k · k. For any dictionary Dic,
the following convex hull
B1(Dic) :=
m
X
j=1
ajdj:djDic,
m
X
j=1
|aj| ≤ 1, m Z+
is a function class generated by Dic, which is commonly used to analyze the convergence
rate of the greedy algorithm (see [3] and [40]). Then, by scaling B1(Dic), we have a linear
sub-space of H,
(2.1) [H]1(Dic) := [
c>0
c·B1(Dic),
where c·B1(Dic) := {c·φ:φB1(Dic)}. Therefore, for any f[H]1(Dic), the distance
between fand B1(Dic)is equal to
kfk[H]1(Dic)= inf{c > 0 : fc·B1(Dic)}.
[39] proved that [H]1(Dic)is also a Banach space w.r.t. norm kfk[H]1(Dic)if supdDic kdk<
. Note that kfk[H]1(Dic)depends on Dic and is called the variation norm or the atomic
norm in the literature.
Note that if we use activation functions to approximate the ridge functions in PPR, some
of the activation functions share the same direction, thus the dictionary associated with AGA
is
(2.2) Dics
n:=
Jn
X
j=1
aj·σs(θx+bj) : θΘp, bj[2,2],
Jn
X
j=1
|aj| ≤ 1
,
where σs(.)is an activation function and will be discussed later, while the subscript nis used
to indicate that the choice of Jndepends on the sample size n. Since Dics
nis designed to
approximate m(x), x Bp
1with bounded support, we will show that a bounded range of those
bjin (2.2) is sufficient, including [-2, 2] in (2.2). Meanwhile, the restriction PJn
j=1 |aj| ≤ 1
is to normalize those possible searching directions in [H]1(Dics
n). In this case, let H=
L2(Bp
1, X) := {f:EX(f2(X)) <∞}. Then, it is known that L2(Bp
1, X)is a Hilbert space
equipped with norm kfk:= EX(f2(X))1/2and inner product hf, gi:= (kf+gk2− kf
gk2)/4for any f, g L2(Bp
1, X). With the above notations, our AGA in population version
can be stated as follows.
PPR.aga (population version) Starting with iteration k= 0, let δ0=mand maga
0= 0.
For each k1, calculate
(2.3)
Jn
X
j=1
ak,j ·σs(θ
kX+bk,j) = arg max
dDics
n
|hδk1, di|,
ENSEMBLE PROJECTION PURSUIT 5
where |hδk1, di| is defined above and measures the similarity between δk1and dDics
n.
Another method to find the above parameters is by residual approximation:
(2.4)
Jn
X
j=1
ak,j ·σs(θ
kX+bk,j) = arg min
h[L2(Bp
1,X)](Dics
n)kδk1hk.
Then, update the approximation by
maga
k=
Jn
X
j=1
a
1,j ·σs(θ
1X+b1,j) + ... +
Jn
X
j=1
a
k,j ·σs(θ
kX+bk,j)
with coefficients a
ι,j s calculated by
(2.5) arg min
a
ι,j
ι=1,...,k
j=1,...,Jn
m(X)
Jn
X
j=1
a
1,j ·σs(θ
1X+b1,j)...
Jn
X
j=1
a
k,j ·σs(θ
kX+bk,j)
,
where directions (θ1, b1),...,(θk, bk)are fixed in the minimization. Meanwhile, the residual
is updated by δk=mmaga
k.
The AGA approximation, denoted by maga
k, bears resemblance to a single-layer artificial
neural network (ANN). However, in AGA each group of Jnneurons shares with the same θ,
whereas in ANN, all θs are allowed to be different. Based on approximation results for ridge
functions and ANN in [34] and [45], this difference allows for a reduction in the number of
θvalues required for the PPR approximation process. Meanwhile, our AGA offers greater
flexibility compared to OGA because it optimizes all coefficients a
ι,j in the projection step,
as demonstrated in (2.5).
THEOREM 1 (upper bound of AGA). Let maga
kbe the output of AGA after k-th iteration
where the input function is mL2(Bp
1, X). Suppose the density of Xsatisfies c1< fX(x)<
cfor some constant c > 0. Then, we have
EX(maga
k(X)m(X))2inf
h[L2(Bp
1,X)]1(Dics
n)nE(m(X)h(X))2
+c(s, p)khk[L2(Bp
1,X)]1(Dics
n)2·k12s+1
2po,(2.6)
for each iteration kZ+and some c(s, p)>0, where [L2(Bp
1, X)]1(Dics
n)is defined in the
same way as [H]1(Dic).
REMARK 1. The first term of the upper bound in (2.6) is the distance between m(x)and
B1(Dics
n), the function space generated by Dics
n, while the second term in (2.6) indicates
that the rate of AGA is k12s+1
2pin the approximation space B1(Dicn
s). We will show later
that the above k12s+1
2pis optimal for global approximation using any linear combination of
kterms in the dictionary Dics
n. The constant c(s, p)in (2.6) is universal. Namely, it does not
depend on nZ+. This fact is essential for the theoretical proof. If we use (2.4) to obtain
(θk, bk,1,...,bk,Jn, ak,1,...,ak,Jn), we can show a lower approximation rate k1, which is
slightly slower than the optimal convergence rate.
We end this section by giving the lower bound of AGA below, from which we can see the
rate of AGA in Theorem 1is optimal.
摘要:

arXiv:2210.14467v2[math.ST]18May2023SubmittedtotheAnnalsofStatisticsENSEMBLEPROJECTIONPURSUITFORGENERALNONPARAMETRICREGRESSIONBYHAORANZHANa,MINGKEZHANGbANDYINGCUNXIAcDepartmentofStatisticsandDataScience,NationalUniversityofSingaporeahaoran.zhan@u.nus.edu,bmingke.zhang@u.nus.edu,cyingcun.xia@nus.edu....

展开>> 收起<<
Ensemble Projection Pursuit for General Nonparametric Regression.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:25 页 大小:350.95KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注