Convergence Rates of Oblique Regression Trees for Flexible Function Libraries Matias D. CattaneoRajita ChandakJason M. Klusowski

2025-05-06 0 0 510.77KB 36 页 10玖币
侵权投诉
Convergence Rates of Oblique Regression Trees for Flexible
Function Libraries*
Matias D. CattaneoRajita ChandakJason M. Klusowski
September 1, 2023
Abstract
We develop a theoretical framework for the analysis of oblique decision trees, where the
splits at each decision node occur at linear combinations of the covariates (as opposed to
conventional tree constructions that force axis-aligned splits involving only a single covariate).
While this methodology has garnered significant attention from the computer science and
optimization communities since the mid-80s, the advantages they oer over their axis-aligned
counterparts remain only empirically justified, and explanations for their success are largely
based on heuristics. Filling this long-standing gap between theory and practice, we show
that oblique regression trees (constructed by recursively minimizing squared error) satisfy a
type of oracle inequality and can adapt to a rich library of regression models consisting of
linear combinations of ridge functions and their limit points. This provides a quantitative
baseline to compare and contrast decision trees with other less interpretable methods, such as
projection pursuit regression and neural networks, which target similar model forms. Contrary
to popular belief, one need not always trade-ointerpretability with accuracy. Specifically, we
show that, under suitable conditions, oblique decision trees achieve similar predictive accuracy
as neural networks for the same library of regression models. To address the combinatorial
complexity of finding the optimal splitting hyperplane at each decision node, our proposed
theoretical framework can accommodate many existing computational tools in the literature.
Our results rely on (arguably surprising) connections between recursive adaptive partitioning
and sequential greedy approximation algorithms for convex optimization problems (e.g.,
orthogonal greedy algorithms), which may be of independent theoretical interest. Using our
theory and methods, we also study oblique random forests.
1 Introduction
Decision trees and neural networks are conventionally seen as two contrasting approaches to
learning. The popular belief is that decision trees compromise accuracy for being easy to use and
understand, whereas neural networks are more accurate, but at the cost of being less transparent.
*
The authors would like to thank Florentina Bunea, Sameer Deshpande, Jianqing Fan, Yingying Fan, Jonathan Siegel,
Bartolomeo Stellato, and William Underwood for insightful discussions. The authors are particularly grateful to two
anonymous reviewers whose comments improved the quality of the paper. MDC was supported in part by the National
Science Foundation through SES-2019432 and SES-2241575. JMK was supported in part by the National Science
Foundation through CAREER DMS-2239448, DMS-2054808, and HDR TRIPODS CCF-1934924.
Department of Operations Research and Financial Engineering, Princeton University.
1
arXiv:2210.14429v2 [math.ST] 30 Aug 2023
We challenge the status quo by showing that, under suitable conditions, oblique decision trees (also
known as multivariate decision trees) achieve similar predictive accuracy as neural networks on
the same library of regression models. Of course, while it is somewhat subjective as to what one
regards as being transparent, it is generally agreed upon that neural networks are less interpretable
than decision trees (Murdoch et al.,2019;Rudin,2019). Indeed, trees are arguably more intuitive
in their construction, which makes it easier to understand how an output is assigned to a given
input, including which predictor variables were relevant in its determination. For example, in
clinical, legal, or business contexts, it may be desirable to build a predictive model that mimics
the way a human user thinks and reasons, especially if the results (of scientific or evidential value)
are to be communicated to a statistical lay audience. Even though it may be sensible to deploy
estimators that more directly target the functional form of the model, predictive accuracy is not the
only factor the modern researcher must consider when designing and building an automated system.
Facilitating human-machine interaction and engagement is also an essential part of this process.
To this end, the technique of knowledge distillation (Buciluundefined et al.,2006) is a quick and
easy way to enhance the fidelity of an interpretable model, without degrading the out-of-sample
performance too severely. In the context of decision trees and neural networks, one distills the
knowledge acquired by a neural network—which relies on nontransparent, distributed hierarchical
representations of the data—and expresses similar knowledge in a decision tree that consists of,
in contrast, easier to understand hierarchical decision rules (Frosst and Hinton,2017). This is
accomplished by first training a neural network on the observed data, and then, in turn, training a
decision tree on data generated from the fitted neural network model.
In this paper, we show that oblique regression trees (constructed by recursively minimizing squared
error) satisfy a type of oracle inequality and can adapt to a rich library of regression models
consisting of linear combinations of ridge functions. This provides a quantitative baseline to
compare and contrast decision trees with other less interpretable methods, such as projection pursuit
regression, neural networks, and boosting machines, which directly target similar model forms.
When neural network and decision tree models are used in tandem to enhance generalization and
interpretability, our theory allows one to measure the knowledge distilled from a neural network to
a decision tree. Using our theory and methods, we also study oblique random forests.
1.1 Background and Prior Work
Let (
y1,
x
T
1
)
,...,
(
yn,
x
T
n
) be a random sample from a joint distribution
P(y,x)
=
Py|xPx
supported on
Y × X
. Here x=(
x1,...,xp
)
T
is a vector of
p
predictor variables supported on
X ⊆ Rp
and
y
is a real-valued outcome variable with range
Y ⊆ R
. Our objective is to compute an estimate of
the conditional expectation,
µ
(x)=
E
[
y|
x], a target which is optimal for predicting
y
from some
function of xin mean squared error. One estimation scheme can be constructed by dividing the
input space
X
into subgroups based on shared characteristics of
y
—something decision trees can
do well.
A decision tree is a hierarchically organized data structure constructed in a top down, greedy
manner through recursive binary splitting. According to CART methodology (Breiman et al.,
1984), a parent node t (i.e., a region in
X
) in the tree is divided into two child nodes, t
L
and t
R
, by
2
maximizing the decrease in sum-of-squares error (SSE)
b
(b,a,t) =1
nX
xit
(yiyt)21
nX
xit
(yiytL1(aTxib)ytR1(aTxi>b))2,(1)
with respect to (
b,
a), with
1
(
·
) denoting the indicator function and
yt
denoting the sample average
of the
yi
data whose corresponding x
i
data lies in the node t. In the conventional axis-aligned
(or, univariate)CART algorithm (Breiman et al.,1984, Section 2.2), splits occur along values of
a single covariate, and so the search space for ais restricted to the set of standard basis vectors
in
Rp
. In this case, the induced partition of the input space
X
is a set of hyper-rectangles. On
the other hand, the oblique CART algorithm (Breiman et al.,1984, Section 5.2) allows for linear
combinations of covariates, extending the search space for ato be all of
Rp
. Such a procedure
generates regions in Rpthat are convex polytopes.
The solution of
(1)
yields estimates (
ˆ
b,ˆ
a
), and the refinement of t produces child nodes t
L
=
{
x
t :
ˆ
aT
x
ˆ
b}
and t
R
=
{
x
t :
ˆ
aT
x
>ˆ
b}
. These child nodes become new parent nodes at the next level
of the tree and can be further refined in the same manner until a desired depth is reached. To obtain
a maximal decision tree
TK
of depth
K
, the procedure is iterated
K
times or until either (i) the node
contains a single data point (
yi,
x
T
i
) or (ii) all input values x
i
and/or all response values
yi
within
the node are the same. The maximal decision tree with maximum depth is denoted by
Tmax
. An
illustration of a maximal oblique decision tree with depth
K
=2 is shown in Figure 1. For contrast,
in Figure 2, we show a maximal axis-aligned decision tree with depth K=2.
In a conventional regression problem, where the goal is to estimate the conditional mean response
µ(x), the canonical tree output for xt is yt, i.e., if Tis a decision tree, then
b
µ(T)(x)=yt=1
n(t) X
xit
yi,(2)
where
n
(t) denotes the number of observations in the node t. However, one can aggregate the data in
each node in a number of ways, depending on the form of the target estimand. In the most general
setting, under weak assumptions, all of our forthcoming theory holds when the node output is the
result of a least squares projection onto the linear span of a finite dictionary
H
that includes the
constant function (e.g., polynomials, splines), that is, ˆytargminhspan(H)Pxit(yih(xi))2.
x1
x2
t3
t4
t6
t5
t0
t2
t1
t6
t5
t4
t3
aT
1xb1aT
1x> b1
aT
2xb2aT
3xb3
aT
2x> b2aT
3x> b3
Figure 1: A maximal oblique decision tree with depth
K
=2 in
p
=2 dimensions. Splits occur
along hyperplanes of the form a1x1+a2x2=b.
3
x1
x2
t3t4
t6
t5
t0
t2
t1
t6
t5
t4
t3
x2b1x2> b1
x1b2x1b3
x1> b2x1> b3
Figure 2: A maximal axis-aligned decision tree with depth
K
=2 in
p
=2 dimensions. Splits occur
along individual covariates of the form xj=bfor j=1,2.
x1
x2
xp
.
.
.
φ(aT
1x)
φ(aT
2x)
φ(aT
Kx)
.
.
.
g
Figure 3: A single hidden layer neural network with Khidden nodes.
One of the main practical issues with oblique CART is that the computational complexity of
minimizing the squared error in
(1)
in each node is extremely demanding (in fact, it is NP-hard).
For example, if we desire to split a node t with
n
(t) observations for axis-aligned CART, an
exhaustive search would require at most
p·n
(t) evaluations, whereas oblique CART would require
a prodigious 2pn(t)
pevaluations (Murthy et al.,1994).
To deal with these computational demands, Breiman et al. (1984) first suggested a method for
inducing oblique decision trees. They use a fully deterministic hill-climbing algorithm to search
for the best oblique split. A backward feature elimination process is also carried out to delete
irrelevant features from the split. Heath et al. (1993) propose a simulated annealing optimization
algorithm, which uses randomization to search for the best split to potentially avoid getting stuck in
a local optimum. Murthy et al. (1994) use a combination of deterministic hill-climbing and random
perturbations in an attempt to find a good hyperplane. See Brodley and Utgo(1995) for additional
variations on these algorithms. Other works employ statistical techniques like linear discriminant
analysis (LDA) (López-Chau et al.,2013;Li et al.,2003;Loh and Shih,1997), principle components
analysis (PCA) (Menze et al.,2011;Rodriguez et al.,2006), and random projections (Tomita et al.,
2020).
While not the focus of the present paper, regarding non-greedy training, other researchers have
attempted to find globally optimal tree solutions using linear programming (Bennett,1994) or
mixed-integer linear programming (Bertsimas and Dunn,2017;Bertsimas et al.,2021). It should
be clear that all of our results hold verbatim for optimal trees, as greedy implementations belong to
4
the same feasible set. While usually better than greedy trees in terms of predictive performance,
scalability to large data sets is the most salient obstacle with globally optimal trees. Moreover, on a
qualitative level, a globally optimal tree arguably detracts from the interpretability, as humans, in
contrast, often exhibit bounded rationality and therefore make decisions in a more sequential (rather
than anticipatory) manner (Hüllermeier et al.,2021, and references therein). Relatedly, another
training technique is based on constructing deep neural networks that realize oblique decision trees
(Lee and Jaakkola,2020;Yang et al.,2018) and then utilizing tools designed for training neural
networks.
While there has been a plethora of greedy algorithms over the past 30 years for training oblique
decision trees, the literature is essentially silent on their statistical properties. For instance, assuming
one can come close to optimizing
(1)
, what types of regression functions can greedy oblique trees
estimate and how well?
1.2 Ridge Expansions
Many empirical studies reveal that oblique trees generally produce smaller trees with better
accuracy compared to axis-aligned trees (Heath et al.,1993;Murthy et al.,1994) and can often
be comparable, in terms of performance, to neural networks (Bertsimas et al.,2018;Bertsimas
and Dunn,2019;Bertsimas and Stellato,2021). Intuitively, allowing a tree-building system to
use both oblique and axis-aligned splits broadens its flexibility. To theoretically showcase these
qualities and make comparisons with other procedures (such as neural networks and projection
pursuit regression), we will consider modeling
µ
with finite linear combinations of ridge functions,
i.e., the library
G=(g(x)=
M
X
k=1
gk(aT
kx),akRp,gk:R7→ R,k=1,...,M,M1,gL1<),
where
∥·∥L1
is a total variation norm that is defined in Section 2.1. This library encompasses
the functions produced from projection pursuit regression, and, more specifically—by taking
gk
(
z
)=
ϕ
(
zbk
), where
ϕ
is a fixed activation function, such as a sigmoid function or ReLU,
and
bkR
is a bias parameter—single hidden layer feed-forward neural networks. A graphical
representation of such a neural network is provided in Figure 3. A neural network forms predictions
according to distributed hierarchical representations of the data, whereas a decision tree uses
hierarchical decision rules (c.f., Figures 1and 2).
Since the first version of our manuscript was released on
arXiv
, several subsequent papers have
employed our novel theoretical and methodological statistical framework to derive consistency
results for decision tree and related methods. For example, Zhan et al. (2023) applies our core ideas
and proof techniques to deduce a consistency result for oblique decision trees in low-dimensional
settings (c.f., Corollary 2below), but under stronger assumptions on the target function class
G
and without accounting for the underlying optimization constraints (c.f., our novel optimization
framework in Section 2.2). Raymaekers et al. (2023) also applies our core ideas and proof techniques
to deduce a consistency result for axis-aligned decision trees within an alternative computation
framework, but under stronger assumptions on the target function class
G
. Finally, Parhi and Nowak
(2023) and DeVore et al. (2023), among others (see their references), study consistency of deep
5
摘要:

ConvergenceRatesofObliqueRegressionTreesforFlexibleFunctionLibraries*MatiasD.Cattaneo†RajitaChandak†JasonM.Klusowski†September1,2023AbstractWedevelopatheoreticalframeworkfortheanalysisofobliquedecisiontrees,wherethesplitsateachdecisionnodeoccuratlinearcombinationsofthecovariates(asopposedtoconventio...

展开>> 收起<<
Convergence Rates of Oblique Regression Trees for Flexible Function Libraries Matias D. CattaneoRajita ChandakJason M. Klusowski.pdf

共36页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:36 页 大小:510.77KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 36
客服
关注