Convergence Rates of Oblique Regression Trees for Flexible Function Libraries Matias D. CattaneoRajita ChandakJason M. Klusowski

2025-05-06 0 0 510.77KB 36 页 10玖币

侵权投诉

Convergence Rates of Oblique Regression Trees for Flexible

Function Libraries*

Matias D. Cattaneo†Rajita Chandak†Jason M. Klusowski†

September 1, 2023

Abstract

We develop a theoretical framework for the analysis of oblique decision trees, where the

splits at each decision node occur at linear combinations of the covariates (as opposed to

conventional tree constructions that force axis-aligned splits involving only a single covariate).

While this methodology has garnered signiﬁcant attention from the computer science and

optimization communities since the mid-80s, the advantages they oﬀer over their axis-aligned

counterparts remain only empirically justiﬁed, and explanations for their success are largely

based on heuristics. Filling this long-standing gap between theory and practice, we show

that oblique regression trees (constructed by recursively minimizing squared error) satisfy a

type of oracle inequality and can adapt to a rich library of regression models consisting of

linear combinations of ridge functions and their limit points. This provides a quantitative

baseline to compare and contrast decision trees with other less interpretable methods, such as

projection pursuit regression and neural networks, which target similar model forms. Contrary

to popular belief, one need not always trade-oﬀinterpretability with accuracy. Speciﬁcally, we

show that, under suitable conditions, oblique decision trees achieve similar predictive accuracy

as neural networks for the same library of regression models. To address the combinatorial

complexity of ﬁnding the optimal splitting hyperplane at each decision node, our proposed

theoretical framework can accommodate many existing computational tools in the literature.

Our results rely on (arguably surprising) connections between recursive adaptive partitioning

and sequential greedy approximation algorithms for convex optimization problems (e.g.,

orthogonal greedy algorithms), which may be of independent theoretical interest. Using our

theory and methods, we also study oblique random forests.

1 Introduction

Decision trees and neural networks are conventionally seen as two contrasting approaches to

learning. The popular belief is that decision trees compromise accuracy for being easy to use and

understand, whereas neural networks are more accurate, but at the cost of being less transparent.

The authors would like to thank Florentina Bunea, Sameer Deshpande, Jianqing Fan, Yingying Fan, Jonathan Siegel,

Bartolomeo Stellato, and William Underwood for insightful discussions. The authors are particularly grateful to two

anonymous reviewers whose comments improved the quality of the paper. MDC was supported in part by the National

Science Foundation through SES-2019432 and SES-2241575. JMK was supported in part by the National Science

Foundation through CAREER DMS-2239448, DMS-2054808, and HDR TRIPODS CCF-1934924.

†Department of Operations Research and Financial Engineering, Princeton University.

arXiv:2210.14429v2 [math.ST] 30 Aug 2023

We challenge the status quo by showing that, under suitable conditions, oblique decision trees (also

known as multivariate decision trees) achieve similar predictive accuracy as neural networks on

the same library of regression models. Of course, while it is somewhat subjective as to what one

regards as being transparent, it is generally agreed upon that neural networks are less interpretable

than decision trees (Murdoch et al.,2019;Rudin,2019). Indeed, trees are arguably more intuitive

in their construction, which makes it easier to understand how an output is assigned to a given

input, including which predictor variables were relevant in its determination. For example, in

clinical, legal, or business contexts, it may be desirable to build a predictive model that mimics

the way a human user thinks and reasons, especially if the results (of scientiﬁc or evidential value)

are to be communicated to a statistical lay audience. Even though it may be sensible to deploy

estimators that more directly target the functional form of the model, predictive accuracy is not the

only factor the modern researcher must consider when designing and building an automated system.

Facilitating human-machine interaction and engagement is also an essential part of this process.

To this end, the technique of knowledge distillation (Buciluundeﬁned et al.,2006) is a quick and

easy way to enhance the ﬁdelity of an interpretable model, without degrading the out-of-sample

performance too severely. In the context of decision trees and neural networks, one distills the

knowledge acquired by a neural network—which relies on nontransparent, distributed hierarchical

representations of the data—and expresses similar knowledge in a decision tree that consists of,

in contrast, easier to understand hierarchical decision rules (Frosst and Hinton,2017). This is

accomplished by ﬁrst training a neural network on the observed data, and then, in turn, training a

decision tree on data generated from the ﬁtted neural network model.

In this paper, we show that oblique regression trees (constructed by recursively minimizing squared

error) satisfy a type of oracle inequality and can adapt to a rich library of regression models

consisting of linear combinations of ridge functions. This provides a quantitative baseline to

compare and contrast decision trees with other less interpretable methods, such as projection pursuit

regression, neural networks, and boosting machines, which directly target similar model forms.

When neural network and decision tree models are used in tandem to enhance generalization and

interpretability, our theory allows one to measure the knowledge distilled from a neural network to

a decision tree. Using our theory and methods, we also study oblique random forests.

1.1 Background and Prior Work

Let (

y1,

)

,...,

(

yn,

) be a random sample from a joint distribution

P(y,x)

Py|xPx

supported on

Y × X

. Here x=(

x1,...,xp

)

is a vector of

predictor variables supported on

X ⊆ Rp

and

is a real-valued outcome variable with range

Y ⊆ R

. Our objective is to compute an estimate of

the conditional expectation,

(x)=

[

x], a target which is optimal for predicting

from some

function of xin mean squared error. One estimation scheme can be constructed by dividing the

input space

into subgroups based on shared characteristics of

—something decision trees can

do well.

A decision tree is a hierarchically organized data structure constructed in a top down, greedy

manner through recursive binary splitting. According to CART methodology (Breiman et al.,

1984), a parent node t (i.e., a region in

) in the tree is divided into two child nodes, t

and t

, by

maximizing the decrease in sum-of-squares error (SSE)

∆(b,a,t) =1

xi∈t

(yi−yt)2−1

xi∈t

(yi−ytL1(aTxi≤b)−ytR1(aTxi>b))2,(1)

with respect to (

a), with

(

) denoting the indicator function and

denoting the sample average

of the

data whose corresponding x

data lies in the node t. In the conventional axis-aligned

(or, univariate)CART algorithm (Breiman et al.,1984, Section 2.2), splits occur along values of

a single covariate, and so the search space for ais restricted to the set of standard basis vectors

. In this case, the induced partition of the input space

is a set of hyper-rectangles. On

the other hand, the oblique CART algorithm (Breiman et al.,1984, Section 5.2) allows for linear

combinations of covariates, extending the search space for ato be all of

. Such a procedure

generates regions in Rpthat are convex polytopes.

The solution of

(1)

yields estimates (

b,ˆ

), and the reﬁnement of t produces child nodes t

{

∈

t :

≤ˆ

and t

{

∈

t :

>ˆ

. These child nodes become new parent nodes at the next level

of the tree and can be further reﬁned in the same manner until a desired depth is reached. To obtain

a maximal decision tree

of depth

, the procedure is iterated

times or until either (i) the node

contains a single data point (

yi,

) or (ii) all input values x

and/or all response values

within

the node are the same. The maximal decision tree with maximum depth is denoted by

Tmax

. An

illustration of a maximal oblique decision tree with depth

=2 is shown in Figure 1. For contrast,

in Figure 2, we show a maximal axis-aligned decision tree with depth K=2.

In a conventional regression problem, where the goal is to estimate the conditional mean response

µ(x), the canonical tree output for x∈t is yt, i.e., if Tis a decision tree, then

µ(T)(x)=yt=1

n(t) X

xi∈t

yi,(2)

where

(t) denotes the number of observations in the node t. However, one can aggregate the data in

each node in a number of ways, depending on the form of the target estimand. In the most general

setting, under weak assumptions, all of our forthcoming theory holds when the node output is the

result of a least squares projection onto the linear span of a ﬁnite dictionary

that includes the

constant function (e.g., polynomials, splines), that is, ˆyt∈argminh∈span(H)Pxi∈t(yi−h(xi))2.

1x≤b1aT

1x> b1

2x≤b2aT

3x≤b3

2x> b2aT

3x> b3

Figure 1: A maximal oblique decision tree with depth

=2 in

=2 dimensions. Splits occur

along hyperplanes of the form a1x1+a2x2=b.

t3t4

x2≤b1x2> b1

x1≤b2x1≤b3

x1> b2x1> b3

Figure 2: A maximal axis-aligned decision tree with depth

=2 in

=2 dimensions. Splits occur

along individual covariates of the form xj=bfor j=1,2.

φ(aT

1x)

φ(aT

2x)

φ(aT

Kx)

Figure 3: A single hidden layer neural network with Khidden nodes.

One of the main practical issues with oblique CART is that the computational complexity of

minimizing the squared error in

(1)

in each node is extremely demanding (in fact, it is NP-hard).

For example, if we desire to split a node t with

(t) observations for axis-aligned CART, an

exhaustive search would require at most

p·n

(t) evaluations, whereas oblique CART would require

a prodigious 2pn(t)

pevaluations (Murthy et al.,1994).

To deal with these computational demands, Breiman et al. (1984) ﬁrst suggested a method for

inducing oblique decision trees. They use a fully deterministic hill-climbing algorithm to search

for the best oblique split. A backward feature elimination process is also carried out to delete

irrelevant features from the split. Heath et al. (1993) propose a simulated annealing optimization

algorithm, which uses randomization to search for the best split to potentially avoid getting stuck in

a local optimum. Murthy et al. (1994) use a combination of deterministic hill-climbing and random

perturbations in an attempt to ﬁnd a good hyperplane. See Brodley and Utgoﬀ(1995) for additional

variations on these algorithms. Other works employ statistical techniques like linear discriminant

analysis (LDA) (López-Chau et al.,2013;Li et al.,2003;Loh and Shih,1997), principle components

analysis (PCA) (Menze et al.,2011;Rodriguez et al.,2006), and random projections (Tomita et al.,

2020).

While not the focus of the present paper, regarding non-greedy training, other researchers have

attempted to ﬁnd globally optimal tree solutions using linear programming (Bennett,1994) or

mixed-integer linear programming (Bertsimas and Dunn,2017;Bertsimas et al.,2021). It should

be clear that all of our results hold verbatim for optimal trees, as greedy implementations belong to

the same feasible set. While usually better than greedy trees in terms of predictive performance,

scalability to large data sets is the most salient obstacle with globally optimal trees. Moreover, on a

qualitative level, a globally optimal tree arguably detracts from the interpretability, as humans, in

contrast, often exhibit bounded rationality and therefore make decisions in a more sequential (rather

than anticipatory) manner (Hüllermeier et al.,2021, and references therein). Relatedly, another

training technique is based on constructing deep neural networks that realize oblique decision trees

(Lee and Jaakkola,2020;Yang et al.,2018) and then utilizing tools designed for training neural

networks.

While there has been a plethora of greedy algorithms over the past 30 years for training oblique

decision trees, the literature is essentially silent on their statistical properties. For instance, assuming

one can come close to optimizing

(1)

, what types of regression functions can greedy oblique trees

estimate and how well?

1.2 Ridge Expansions

Many empirical studies reveal that oblique trees generally produce smaller trees with better

accuracy compared to axis-aligned trees (Heath et al.,1993;Murthy et al.,1994) and can often

be comparable, in terms of performance, to neural networks (Bertsimas et al.,2018;Bertsimas

and Dunn,2019;Bertsimas and Stellato,2021). Intuitively, allowing a tree-building system to

use both oblique and axis-aligned splits broadens its ﬂexibility. To theoretically showcase these

qualities and make comparisons with other procedures (such as neural networks and projection

pursuit regression), we will consider modeling

with ﬁnite linear combinations of ridge functions,

i.e., the library

G=(g(x)=

k=1

gk(aT

kx),ak∈Rp,gk:R7→ R,k=1,...,M,M≥1,∥g∥L1<∞),

where

∥·∥L1

is a total variation norm that is deﬁned in Section 2.1. This library encompasses

the functions produced from projection pursuit regression, and, more speciﬁcally—by taking

(

z−bk

), where

is a ﬁxed activation function, such as a sigmoid function or ReLU,

and

bk∈R

is a bias parameter—single hidden layer feed-forward neural networks. A graphical

representation of such a neural network is provided in Figure 3. A neural network forms predictions

according to distributed hierarchical representations of the data, whereas a decision tree uses

hierarchical decision rules (c.f., Figures 1and 2).

Since the ﬁrst version of our manuscript was released on

arXiv

, several subsequent papers have

employed our novel theoretical and methodological statistical framework to derive consistency

results for decision tree and related methods. For example, Zhan et al. (2023) applies our core ideas

and proof techniques to deduce a consistency result for oblique decision trees in low-dimensional

settings (c.f., Corollary 2below), but under stronger assumptions on the target function class

and without accounting for the underlying optimization constraints (c.f., our novel optimization

framework in Section 2.2). Raymaekers et al. (2023) also applies our core ideas and proof techniques

to deduce a consistency result for axis-aligned decision trees within an alternative computation

framework, but under stronger assumptions on the target function class

. Finally, Parhi and Nowak

(2023) and DeVore et al. (2023), among others (see their references), study consistency of deep

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ConvergenceRatesofObliqueRegressionTreesforFlexibleFunctionLibraries*MatiasD.Cattaneo†RajitaChandak†JasonM.Klusowski†September1,2023AbstractWedevelopatheoreticalframeworkfortheanalysisofobliquedecisiontrees,wherethesplitsateachdecisionnodeoccuratlinearcombinationsofthecovariates(asopposedtoconventio...

展开>> 收起<<

Convergence Rates of Oblique Regression Trees for Flexible Function Libraries Matias D. CattaneoRajita ChandakJason M. Klusowski.pdf

共36页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Convergence Rates of Oblique Regression Trees for Flexible Function Libraries Matias D. CattaneoRajita ChandakJason M. Klusowski

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: