Ensemble Projection Pursuit for General Nonparametric Regression

2025-05-06 0 0 350.95KB 25 页 10玖币

侵权投诉

arXiv:2210.14467v2 [math.ST] 18 May 2023

Submitted to the Annals of Statistics

ENSEMBLE PROJECTION PURSUIT FOR GENERAL NONPARAMETRIC

REGRESSION

BYHAORAN ZHANa, MINGKE ZHANGbAND YINGCUN XIAc

Department of Statistics and Data Science, National University of Singapore

ahaoran.zhan@u.nus.edu ,bmingke.zhang@u.nus.edu ,cyingcun.xia@nus.edu.sg

The projection pursuit regression (PPR) has played an important role in

the development of statistics and machine learning. However, when compared

to other established methods like random forests (RF) and support vector

machines (SVM), PPR has yet to showcase a similar level of accuracy as a

statistical learning technique. In this paper, we revisit the estimation of PPR

and propose an optimal greedy algorithm and an ensemble approach via "fea-

ture bagging", hereafter referred to as ePPR, aiming to improve the efﬁcacy.

Compared to RF, ePPR has two main advantages. Firstly, its theoretical con-

sistency can be proved for more general regression functions as long as they

are L2integrable, and higher consistency rates can be achieved. Secondly,

ePPR does not split the samples, and thus each term of PPR is estimated us-

ing the whole data, making the minimization more efﬁcient and guaranteeing

the smoothness of the estimator. Extensive comparisons based on real data

sets show that ePPR is more efﬁcient in regression and classiﬁcation than

RF and other competitors. The efﬁcacy of ePPR, a variant of Artiﬁcial Neu-

ral Networks (ANN), demonstrates that with suitable statistical tuning, ANN

can equal or even exceed RF in dealing with small to medium-sized datasets.

This revelation challenges the widespread belief that ANN’s superiority over

RF is limited to processing extensive sample sizes.

1. Introduction. The technique of projection pursuit, which was ﬁrst introduced by [30]

and [21], aims to reveal the most signiﬁcant structures within high-dimensional data by se-

quentially utilizing lower-dimensional projections. Once a projection has been identiﬁed, the

structure that lies along it is removed, and the procedure is then reiterated to detect the next

projection. This idea was later implemented for regression by [20], who referred to this tech-

nique as the projection pursuit regression (PPR) and popularized it in the 1980s. PPR is well-

suited for estimating general nonparametric regressions due to its capability to approximate

any general function [27,34,35].

More speciﬁcally, consider the general regression problem with response Yand predictor

X∈Rp. Our main interest lies in the conditional mean, m(X) = E(Y|X). Following the

categorization of two cultures of [6], the projection pursuit regression can be used in two

ways. The ﬁrst way to use PPR is by assuming model m(X) = g1(θ⊤

1X) + ... +gK(θ⊤

KX),

where Kis assumed to be a ﬁxed and ﬁnite, yet unknown number. See, for example, [12],

[23] and [44]. It is evident that this statistical model may not be appropriate for real-world

data and is not recommended by [6]. The second way to use PPR is by approximating or

estimating a general nonparametric regression based on the following fact: for any smooth

function m, there exists a sequence of projection directions {θk∈Rp:k= 1,2, ...}and a

MSC2020 subject classiﬁcations:Primary 62G08; secondary 62G20.

Keywords and phrases: convergence of algorithm, consistency of estimation, greedy algorithm, nonparametric

smoothing, projection pursuit regression, random forest.

sequence of ridge functions {gk(.), k = 1,2, ...}satisfying

(1.1) PPR expansion: m(X) =

∞

k=1

gk(θ⊤

kX)a.s..

Under mild conditions, [34] and [35] established that the PPR expansion holds for a general

function m(x)in the L2sense, and the PPR algorithm proposed in [20] is an implementation

of this expansion.

The PPR expansion can be estimated with two different approaches: the global minimiza-

tion and the greedy algorithm. The global minimization is to minimize the loss function over

all the ridge functions and projection directions, i.e.,

min

gk,θk,k=1,2,...

E Y−

∞

k=1

gk(θ⊤

kX)!2

This is the classical estimation method in statistics and machine learning, including the esti-

mation of artiﬁcial neural networks (ANN). In practice, the number of terms in the expansion

must be ﬁnite and goes to inﬁnity with sample size. With i.i.d. data {(Xi, Yi)}n

i=1, the esti-

mator based on the global minimization is ˆmn(X) = PKn

k=1 ˆgk(ˆ

θ⊤

kX), where

(ˆgk,ˆ

θk) = arg min

gk∈Vn,θk∈Θp

i=1 Yi−

k=1

gk(θ⊤

kXi)2

and Vnis an approximation function space which will be discussed later. Under some regu-

larity conditions, we can show that ˆmnachieves a nearly optimal consistency rate, i.e.,

E( ˆmn(X)−m(X))2≤c(ln4/n)−2r/(2r+p),

where c > 0and ris the degree of smoothness of m; see SUPPLEMENTARY MATERIALS

(SM) for the proof. This method has good statistical asymptotic properties, but in practice, its

advantages are hardly observed, perhaps because of the large number of parameters required

for estimation.

Instead of estimating gk(θ⊤

kx), k = 1,2, ... together, the greedy algorithm estimates the

PPR expansion term by term in a sequential manner, which was originally introduced by [20]

in their PPR estimation procedure, as described below. Let

(1.2) θ1=arg min

θ:||θ||2=1

E(Y−E(Y|X⊤θ))2

and g1(v) = E(Y|X⊤θ1=v). For k≥2, let the residual δk=Y−Pk−1

ι=1 gι(X⊤θι), and

calculate

(1.3) θk=arg min

θ:||θ||2=1

E(δk−E(δk|X⊤θ))2

and gk(v) = E(Y|X⊤θk=v). [26] proved that the PPR algorithm converges in the sense that

(1.4) Ehm(X)−

k=1

gk(X⊤θk)

2i→0, as K → ∞.

In the literature, there are various proposed improvements to the aforementioned greedy al-

gorithm. For instance, [27] proposed the relaxed greedy algorithm (RGA) as a means of

accelerating the convergence rate of the PPR algorithm. In RGA, the approximation obtained

in step ktakes the form of α·Pk−1

ι=1 gι(XTθι) + (1 −α)·gk(XTθι)for some α∈(0,1). The

orthogonal greedy algorithm (OGA), which is well-known in statistical learning and signal

ENSEMBLE PROJECTION PURSUIT 3

processing (see [9] and [3]), is another improvement. In OGA, once gk(XTθk)is obtained

in the k-th iteration, the approximation is updated by projecting m(X)onto the linear space

spanned by {gι(XTθι)}k

ι=1.

Although PPR has good algorithmic and mathematical guarantees, it has not gained popu-

larity among researchers, nor has it been extensively studied. Existing computer packages of

PPR do not provide good prediction accuracy in comparison with the other popular methods

such as RF and the support vector machine [15, SVM]. [24] suggested that PPR’s limited

popularity was due to its computational complexity during the 1980s, when computational

power was signiﬁcantly lower than it is today. As a result, the artiﬁcial neural network (ANN)

became a more popular choice although they share some similarities. Nevertheless, with the

tremendous developments in computational power and algorithms, there should be a growing

interest in studying PPR more comprehensively. This is the primary motivation behind this

paper.

The purpose of this paper is to investigate the greedy algorithm of PPR approximation

and its convergence theory. To further improve the prediction accuracy of PPR, we propose a

boosting method, called ensemble PPR (ePPR) hereafter. The fundamental idea is to search

for projections in (1.2) and (1.3) in a randomly selected subset of X, in a similar way as

that RF selects split variables [7], which is also called “feature bagging” [25]. Thus, each

run of the algorithm produces a random PPR approximation of the unknown regression func-

tion. The ensemble technique is then employed to generate a ﬁnal estimate of the regression

function.

1.1. Related work and our contribution. Our work includes two parts: the greedy algo-

rithm for PPR and a boosting method for a statistical implementation of PPR. Our contribu-

tion is summarized as follows.

• An optimal greedy algorithm, called the Additive Greedy Algorithm (AGA). The algorithm

is motivated by the orthogonal greedy algorithm [33, OGA] and the additive structure of

the expansion. We will show that AGA achieves the optimal convergence rate of greedy

algorithms and is faster than the existing algorithms for PPR expansion (see [3]).

• An ensemble estimation, called ensemble PPR (ePPR). This is partly motivated by RF of

[6]; see also [47] and [1] for the recent development of RF. ePPR shares one of the core

elements of RF: the former uses recursive projection and the latter recursive partitioning.

ePPR also borrows two of the other core elements of RF: random selection of predictors,

also known as feature bagging, and averaging the individual estimates for the ensemble.

• Advantages over RF, the benchmark of nonparametric regression based on the greedy algo-

rithm. These advantages include: (i) As noted by [18], RF has limited ability to ﬁt smooth

signals because of the step-functions inherited from the classiﬁcation and regression tree

(CART). In contrast, PPR does not split the samples and uses the whole data in each step,

making the minimization more efﬁcient and guaranteeing the estimator smooth. (ii) More-

over, while RF theoretically only works for some special classes of regression functions

(see, for example, [37] and [14]), PPR or ePPR can handle more general regression func-

tions as long as the underlying function is L2integrable. (iii) Additionally, ePPR has a

higher consistency rate than RF, as demonstrated in [28] for additive model and [14] for

slightly more general regression functions.

• Our work also sheds light on the long debate about when ANN and RF have advantages

over each other. As a special variant of ANN, ePPR suggests that ANN can perform as

well as or even better than RF for data of small to medium size when statistical tuning is

incorporated. This contradicts the commonly received understanding of ANN and RF that

ANN can only have advantages for data with large sample size.

1.2. Organization of this paper. The rest of this paper is organized as follows. Section

2introduces some notations, and proposes the greedy algorithm, AGA, and studies its con-

vergence rate. Section 3considers the consistency of statistical estimators of AGA. Details

of ePPR are presented in Section 4; its statistical consistency is also studied in the section.

Extensive numerical studies of real data sets, including comparison with the popular machine

learning methods such as RF and SVM, are presented in Section 5. The technical proofs are

provided in Section 6, while some auxiliary results and lemmas used in the proofs are given

in SUPPLEMENTARY MATERIALS (SM) in a separate ﬁle. A brief discussion about the

problems with the proposed ePPR is given in Section 7.

2. PPR with greedy algorithm. A greedy algorithm always associates with a dictionary

Dic ⊆H, where His a Banach space equipped with norm k · k. For any dictionary Dic,

the following convex hull

B1(Dic) := 





j=1

ajdj:dj∈Dic,

j=1

|aj| ≤ 1, m ∈Z+





is a function class generated by Dic, which is commonly used to analyze the convergence

rate of the greedy algorithm (see [3] and [40]). Then, by scaling B1(Dic), we have a linear

sub-space of H,

(2.1) [H]1(Dic) := [

c>0

c·B1(Dic),

where c·B1(Dic) := {c·φ:φ∈B1(Dic)}. Therefore, for any f∈[H]1(Dic), the distance

between fand B1(Dic)is equal to

kfk[H]1(Dic)= inf{c > 0 : f∈c·B1(Dic)}.

[39] proved that [H]1(Dic)is also a Banach space w.r.t. norm kfk[H]1(Dic)if supd∈Dic kdk<

∞. Note that kfk[H]1(Dic)depends on Dic and is called the variation norm or the atomic

norm in the literature.

Note that if we use activation functions to approximate the ridge functions in PPR, some

of the activation functions share the same direction, thus the dictionary associated with AGA

(2.2) Dics

n:= 





j=1

aj·σs(θ⊤x+bj) : θ∈Θp, bj∈[−2,2],

j=1

|aj| ≤ 1





where σs(.)is an activation function and will be discussed later, while the subscript nis used

to indicate that the choice of Jndepends on the sample size n. Since Dics

nis designed to

approximate m(x), x ∈Bp

1with bounded support, we will show that a bounded range of those

bjin (2.2) is sufﬁcient, including [-2, 2] in (2.2). Meanwhile, the restriction PJn

j=1 |aj| ≤ 1

is to normalize those possible searching directions in [H]1(Dics

n). In this case, let H=

L2(Bp

1, X) := {f:EX(f2(X)) <∞}. Then, it is known that L2(Bp

1, X)is a Hilbert space

equipped with norm kfk:= EX(f2(X))1/2and inner product hf, gi:= (kf+gk2− kf−

gk2)/4for any f, g ∈L2(Bp

1, X). With the above notations, our AGA in population version

can be stated as follows.

PPR.aga (population version) Starting with iteration k= 0, let δ0=mand maga

0= 0.

For each k≥1, calculate

(2.3)

j=1

ak,j ·σs(θ⊤

kX+bk,j) = arg max

d∈Dics

|hδk−1, di|,

ENSEMBLE PROJECTION PURSUIT 5

where |hδk−1, di| is deﬁned above and measures the similarity between δk−1and d∈Dics

Another method to ﬁnd the above parameters is by residual approximation:

(2.4)

j=1

ak,j ·σs(θ⊤

kX+bk,j) = arg min

h∈[L2(Bp

1,X)](Dics

n)kδk−1−hk.

Then, update the approximation by

maga

j=1

a′

1,j ·σs(θ⊤

1X+b1,j) + ... +

j=1

a′

k,j ·σs(θ⊤

kX+bk,j)

with coefﬁcients a′

ι,j s calculated by

(2.5) arg min

a′

ι,j

ι=1,...,k

j=1,...,Jn



m(X)−

j=1

a′

1,j ·σs(θ⊤

1X+b1,j)−... −

j=1

a′

k,j ·σs(θ⊤

kX+bk,j)



,

where directions (θ1, b1),...,(θk, bk)are ﬁxed in the minimization. Meanwhile, the residual

is updated by δk=m−maga

The AGA approximation, denoted by maga

k, bears resemblance to a single-layer artiﬁcial

neural network (ANN). However, in AGA each group of Jnneurons shares with the same θ,

whereas in ANN, all θ’s are allowed to be different. Based on approximation results for ridge

functions and ANN in [34] and [45], this difference allows for a reduction in the number of

θvalues required for the PPR approximation process. Meanwhile, our AGA offers greater

ﬂexibility compared to OGA because it optimizes all coefﬁcients a′

ι,j in the projection step,

as demonstrated in (2.5).

THEOREM 1 (upper bound of AGA). Let maga

kbe the output of AGA after k-th iteration

where the input function is m∈L2(Bp

1, X). Suppose the density of Xsatisﬁes c−1< fX(x)<

cfor some constant c > 0. Then, we have

EX(maga

k(X)−m(X))2≤inf

h∈[L2(Bp

1,X)]1(Dics

n)nE(m(X)−h(X))2

+c(s, p)khk[L2(Bp

1,X)]1(Dics

n)2·k−1−2s+1

2po,(2.6)

for each iteration k∈Z+and some c(s, p)>0, where [L2(Bp

1, X)]1(Dics

n)is deﬁned in the

same way as [H]1(Dic).

REMARK 1. The ﬁrst term of the upper bound in (2.6) is the distance between m(x)and

B1(Dics

n), the function space generated by Dics

n, while the second term in (2.6) indicates

that the rate of AGA is k−1−2s+1

2pin the approximation space B1(Dicn

s). We will show later

that the above k−1−2s+1

2pis optimal for global approximation using any linear combination of

kterms in the dictionary Dics

n. The constant c(s, p)in (2.6) is universal. Namely, it does not

depend on n∈Z+. This fact is essential for the theoretical proof. If we use (2.4) to obtain

(θk, bk,1,...,bk,Jn, ak,1,...,ak,Jn), we can show a lower approximation rate k−1, which is

slightly slower than the optimal convergence rate.

We end this section by giving the lower bound of AGA below, from which we can see the

rate of AGA in Theorem 1is optimal.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

arXiv:2210.14467v2[math.ST]18May2023SubmittedtotheAnnalsofStatisticsENSEMBLEPROJECTIONPURSUITFORGENERALNONPARAMETRICREGRESSIONBYHAORANZHANa,MINGKEZHANGbANDYINGCUNXIAcDepartmentofStatisticsandDataScience,NationalUniversityofSingaporeahaoran.zhan@u.nus.edu,bmingke.zhang@u.nus.edu,cyingcun.xia@nus.edu....

展开>> 收起<<

Ensemble Projection Pursuit for General Nonparametric Regression.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Ensemble Projection Pursuit for General Nonparametric Regression

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: