LARF Two-level Attention-based Random Forests with a Mixture of Contamination Models Andrei V. Konstantinov and Lev V. Utkin

2025-05-03 0 0 630.38KB 20 页 10玖币

侵权投诉

LARF: Two-level Attention-based Random Forests with a Mixture

of Contamination Models

Andrei V. Konstantinov and Lev V. Utkin

Peter the Great St.Petersburg Polytechnic University

St.Petersburg, Russia

e-mail: andrue.konst@gmail.com, lev.utkin@gmail.com

Abstract

New models of the attention-based random forests called LARF (Leaf Attention-based Random

Forest) are proposed. The ﬁrst idea behind the models is to introduce a two-level attention, where

one of the levels is the “leaf” attention and the attention mechanism is applied to every leaf of trees.

The second level is the tree attention depending on the “leaf” attention. The second idea is to replace

the softmax operation in the attention with the weighted sum of the softmax operations with diﬀerent

parameters. It is implemented by applying a mixture of the Huber’s contamination models and can

be regarded as an analog of the multi-head attention with “heads” deﬁned by selecting a value of

the softmax parameter. Attention parameters are simply trained by solving the quadratic optimization

problem. To simplify the tuning process of the models, it is proposed to make the tuning contamination

parameters to be training and to compute them by solving the quadratic optimization problem. Many

numerical experiments with real datasets are performed for studying LARFs. The code of proposed

algorithms can be found in https://github.com/andruekonst/leaf-attention-forest.

Keywords: attention mechanism, random forest, Nadaraya-Watson regression, quadratic program-

ming, contamination model

1 Introduction

Several crucial improvements of neural networks have been made in recent years. One of them is the

attention mechanism which has played an important role in many machine learning areas, including the

natural language processing models, the computer vision, etc. [1, 2, 3, 4, 5]. The idea behind the attention

mechanism is to assign weights to features or examples in accordance with their importance and their

impact on the model predictions. At the same time, the success of the attention models as components of

neural network motivates to extend this approach to other machine learning models diﬀerent from neural

networks, for example, to random forests (RFs) [6]. Following this idea, a new model called the attention-

based random forest (ABRF), which incorporates the attention mechanism into ensemble-based models

such as RFs and the gradient boosting machine [7, 8] has been developed [9, 10]. The ABRF models stems

from the interesting interpretation [1, 11] of the attention mechanism through the Nadaraya-Watson kernel

regression model [12, 13]. According to [9, 10], attention weights in the Nadaraya-Watson regression are

assigned to decision trees in a RF depending on examples which fall into leaves of trees. Weights in ABRF

have trainable parameters and use the Huber’s -contamination model [14] for deﬁning the attention weights

such that each attention weight consists of two parts: the softmax operation with the tuning coeﬃcient

1−and the trainable bias of the softmax weight with coeﬃcient . One of the improvements of ABRF,

which has been proposed in [15], is based on joint incorporating self-attention and attention mechanisms

into the RF. The proposed models outperform ABRF, but this outperformance is not suﬃcient. The model

for several dataset provided inferior results. Therefore, we propose a set of models which can be regarded

as extensions of ABRF and are based on two main ideas.

arXiv:2210.05168v1 [cs.LG] 11 Oct 2022

The ﬁrst idea is to introduce a two-level attention, where one of the levels is the “leaf” attention, i.e.,

the attention mechanism is applied to every leaf of trees. As a result, we get the attention weights assigned

to leaves and the attention weights assigned to trees. At that, the attention weights of trees depend on the

corresponding weights of leaves belonging to these trees. In other words, the attention at the second level

depends on the attention at the ﬁrst level, i.e., we get the attention of the attention. Due to the “leaf”

attention, the proposed model will be abbreviated as LARF (Leaf Attention-based Random Forest).

One of the peculiarities of LARFs is the use of a mixture of the Huber’s -contamination models instead

of the single contamination model as it has been done in ABRF. This peculiarity stems from the second

idea behind the model to take into account the softmax operation with diﬀerent parameters simultaneously.

In fact, we replace the standard softmax operation by the weighted sum of the softmax operations with

diﬀerent parameters. With this idea, we achieve two goals. First of all, we partially solve the problem of

tuning parameters of the softmax operations which are a part of attention operations. Each value of the

tuning parameter from a predeﬁned set (from a predeﬁned grid) is used in a separate softmax operation.

Then weights of the softmax operations in the sum are trained jointly with training other parameters.

This approach can also be interpreted as the linear approximation of the softmax operations with trainable

weights and with diﬀerent values of tuning parameters. However, a more interesting goal is that some

analog of the multi-head attention [16] is implemented by using the mixture of contamination models

where “heads” are deﬁned by selecting a value of the corresponding softmax operation parameter.

Additionally, in contrast to ABRF [10] where the contamination parameter of the Huber’s model

was a tuning parameter, the LARF model considers this parameter as the training one. This allows us to

signiﬁcantly reduce the model tuning time and to avoid enumeration of the parameter values in accordance

with a grid. The same is implemented for the mixture of the Huber’s models.

Diﬀerent conﬁgurations of LARF produce a set of models which depend on trainable parameters of the

two-level attention and its implementation.

We investigate two types of RFs in experiments: original RFs and Extremely Randomized Trees (ERT)

[17]. According to [17], the ERT algorithm chooses a split point randomly for each feature at each node

and then selects the best split among these.

Our contributions can be summarized as follows:

1. New two-level attention-based RF models are proposed, where the attention mechanism at the ﬁrst

level is applied to every leaf of trees, the attention at the second level incorporates the “leaf” attention

and is applied to trees. Training of the two-level attention is reduced to solving the standard quadratic

optimization problem.

2. A mixture of the Huber’s -contamination models is used to implement the attention mechanism at the

second level. The mixture allows us to replace a set of tuning attention parameters (the temperature

parameters of the softmax operations) with trainable parameters whose optimal values are computed

by solving the quadratic optimization problem. Moreover, this approach can be regarded as an analog

of the multi-head attention.

3. An approach is proposed to make the tuning contamination parameters (parameters) in the mixture

of the -contamination models to be training. Their optimal values are also computed by solving the

quadratic optimization problem.

4. Many numerical experiments with real datasets are performed for studying LARFs. They demon-

strate outperforming results of some modiﬁcations of LARF. The code of proposed algorithms can

be found in https://github.com/andruekonst/leaf-attention-forest.

The paper is organized as follows. Related work can be found in Section 2. A brief introduction to the

attention mechanism as the Nadaraya-Watson kernel regression is given in Section 3. A general approach

to incorporating the two-level attention mechanism into the RF is provided in Section 4. Ways for imple-

mentation of the two-level attention mechanism and constructing several attention-based models by using

the mixture of the Huber’s -contamination models are considered in Section 5. Numerical experiments

with real data illustrating properties of the proposed models are provided in Section 6. Concluding remarks

can be found in Section 7.

2 Related work

Attention mechanism. Due to the great eﬃciency of machine learning models with the attention mech-

anisms, interest in the diﬀerent attention-based models has increased signiﬁcantly in recent years. As a

result, many attention models have been proposed to improve the performance of machine learning algo-

rithms. The most comprehensive analysis and description of various attention-based models can be found

in interesting surveys [1, 2, 3, 4, 5, 18].

It is important to note that parametric attention models as parts of neural networks are mainly trained

by applying the gradient-based algorithms which lead to computational problems when training is carried

out through the softmax function. Many approaches have been proposed to cope with this problem.

A large part of approaches is based on some kinds of linear approximation of the softmax attention of

[19, 20, 21, 22]. Another part of the approaches is based on random feature methods to approximate the

softmax function [18, 23].

Another improvement of the attention-based models is to use the self-attention which was proposed in

[16] as a crucial component of neural networks called Transformers. The self-attention models have been

also studied in surveys [4, 24, 25, 26, 27, 28]. This is only a small part of all works devoted to the attention

and self-attention mechanisms.

It should be noted that the aforementioned models are implemented as neural networks, and they have

not been studied for application to other machine learning models, for example, to RFs. Attempts to

incorporate the attention and self-attention mechanisms into the RF and the gradient boosting machine

were made in [9, 10, 15]. Following these works, we extend the proposed models to improve the attention-

based models. Moreover, we propose the attention models which do not use the gradient-based algorithms

for computing optimal attention parameters. The training process of the models is based on solving

standard quadratic optimization problems.

Weighted RFs. Many approaches have been proposed in recent years to improve RFs. One of the

important approaches is based on assignment of weights to decision trees in the RF. This approach is

implemented in various algorithms [29, 30, 31, 32, 33, 34, 35]. However, most these algorithms have an

disadvantage. The weights are assigned to trees independently of examples, i.e., each weight characterizes

trees on the average over all training examples and does not take into account any feature vector. Moreover,

the weights do not have training parameters which usually make the model more ﬂexible and accurate.

Contamination model in attention mechanisms. There are several models which use imprecise

probabilities in order to model the lack of suﬃcient training data. One of the ﬁrst models is the so-

called Credal Decision Tree, which is based on applying the imprecise probability theory to classiﬁcation

and proposed in [36]. Following this work, a number of models based on imprecise probabilities were

presented in [37, 38, 39, 40] where the imprecise Dirichlet model is used. This model can be regarded

as a reparametrization of the imprecise -contamination model which is applied to LARF. The imprecise

-contamination model has been also applied to machine learning methods, for example, to the support

vector machine [41] or to the RF [42]. The attention-based RF applying the imprecise -contamination

model to the parametric attention mechanism was proposed in [10, 15]. However, there were no other

works which use the imprecise models in order to implement the attention mechanism.

3 Nadaraya-Watson regression and the attention mechanism

A basis of the attention mechanism can be considered in the framework of the Nadaraya-Watson kernel

regression model [12, 13] which estimates a function fas a locally weighted average using a kernel as

a weighting function. Suppose that the dataset is represented by nexamples (x1, y1), ..., (xn, yn), where

xi= (xi1, ..., xim)∈Rmis a feature vector consisting of mfeatures; yi∈Ris a regression output. The

regression task is to construct a regressor f:Rm→Rwhich can predict the output value ˜yof a new

observation x, using the dataset.

The Nadaraya-Watson kernel regression estimates the regression output ˜ycorresponding to a new input

feature vector xas follows [12, 13]:

˜y=

i=1

α(x,xi)yi,(1)

where weight α(x,xi) conforms with relevance of the feature vector xito the vector x.

It can be seen from the above that the Nadaraya-Watson regression model estimates ˜yas a weighted

sum of training outputs yifrom the dataset such that their weights depend on location of xirelative to x.

This mean that the closer xito x, the greater the weight assigned to yi.

According to the Nadaraya-Watson kernel regression [12, 13], weights can be deﬁned by means of a

kernel Kas a function of the distance between vectors xiand x. The kernel estimates how xiis close to

x. Then the weight is written as follows:

α(x,xi) = K(x,xi)

j=1 K(x,xj).(2)

One of the popular kernel is the Gaussian kernel. It produces weights of the form:

α(x,xi) = σ −kx−xik2

τ!,(3)

where τis a tuning (temperature) parameter; σ(·) is a notation of the softmax operation.

In terms of the attention mechanism [43], vector x, vectors xi, outputs yiand weight α(x,xi) are

called as the query,keys,values and the attention weight, respectively. Weights α(x,xi) can be extended

by incorporating trainable parameters. In particular, parameter τcan be also regarded as the trainable

parameter.

Many forms of parametric attention weights, which also deﬁne the attention mechanisms, have been

proposed, for example, the additive attention [43], the multiplicative or dot-product attention [16, 44]. We

will consider also the attention weights based on the Gaussian kernels, i.e., producing the softmax operation.

However, the parametric forms of the attention weights will be quite diﬀerent from many popular attention

operations.

4 Two-level attention-based random forest

One of the powerful machine learning models handling with tabular data is the RF which can be regarded

as an ensemble of Tdecision trees such that each tree is trained on a subset of examples randomly selected

from the training set. In the original RF, the ﬁnal RF prediction ˜yfor a testing example xis determined

by averaging predictions ˜y1, ..., ˜yTobtained for all trees.

Let Jk(x) be the index set of examples which fall into the same leaf in the k-th tree as x. One of the

ways to construct the attention-based RF is to introduce the mean vector Ak(x) deﬁned as the mean of

training vectors xjwhich fall into the same leaf as x. However, this simple deﬁnition can be extended by

incorporating the Nadaraya-Watson regression into the leaf. In this case, we can write

Ak(x) = X

j∈Jk(x)

µ(x,xj)xj,(4)

where µ(x,xj) is the attention weight in accordance with the Nadaraya-Watson kernel regression.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LARF:Two-levelAttention-basedRandomForestswithaMixtureofContaminationModelsAndreiV.KonstantinovandLevV.UtkinPetertheGreatSt.PetersburgPolytechnicUniversitySt.Petersburg,Russiae-mail:andrue.konst@gmail.com,lev.utkin@gmail.comAbstractNewmodelsoftheattention-basedrandomforestscalledLARF(LeafAttention-b...

展开>> 收起<<

LARF Two-level Attention-based Random Forests with a Mixture of Contamination Models Andrei V. Konstantinov and Lev V. Utkin.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LARF Two-level Attention-based Random Forests with a Mixture of Contamination Models Andrei V. Konstantinov and Lev V. Utkin

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: