LARF Two-level Attention-based Random Forests with a Mixture of Contamination Models Andrei V. Konstantinov and Lev V. Utkin

2025-05-03 0 0 630.38KB 20 页 10玖币
侵权投诉
LARF: Two-level Attention-based Random Forests with a Mixture
of Contamination Models
Andrei V. Konstantinov and Lev V. Utkin
Peter the Great St.Petersburg Polytechnic University
St.Petersburg, Russia
e-mail: andrue.konst@gmail.com, lev.utkin@gmail.com
Abstract
New models of the attention-based random forests called LARF (Leaf Attention-based Random
Forest) are proposed. The first idea behind the models is to introduce a two-level attention, where
one of the levels is the “leaf” attention and the attention mechanism is applied to every leaf of trees.
The second level is the tree attention depending on the “leaf” attention. The second idea is to replace
the softmax operation in the attention with the weighted sum of the softmax operations with different
parameters. It is implemented by applying a mixture of the Huber’s contamination models and can
be regarded as an analog of the multi-head attention with “heads” defined by selecting a value of
the softmax parameter. Attention parameters are simply trained by solving the quadratic optimization
problem. To simplify the tuning process of the models, it is proposed to make the tuning contamination
parameters to be training and to compute them by solving the quadratic optimization problem. Many
numerical experiments with real datasets are performed for studying LARFs. The code of proposed
algorithms can be found in https://github.com/andruekonst/leaf-attention-forest.
Keywords: attention mechanism, random forest, Nadaraya-Watson regression, quadratic program-
ming, contamination model
1 Introduction
Several crucial improvements of neural networks have been made in recent years. One of them is the
attention mechanism which has played an important role in many machine learning areas, including the
natural language processing models, the computer vision, etc. [1, 2, 3, 4, 5]. The idea behind the attention
mechanism is to assign weights to features or examples in accordance with their importance and their
impact on the model predictions. At the same time, the success of the attention models as components of
neural network motivates to extend this approach to other machine learning models different from neural
networks, for example, to random forests (RFs) [6]. Following this idea, a new model called the attention-
based random forest (ABRF), which incorporates the attention mechanism into ensemble-based models
such as RFs and the gradient boosting machine [7, 8] has been developed [9, 10]. The ABRF models stems
from the interesting interpretation [1, 11] of the attention mechanism through the Nadaraya-Watson kernel
regression model [12, 13]. According to [9, 10], attention weights in the Nadaraya-Watson regression are
assigned to decision trees in a RF depending on examples which fall into leaves of trees. Weights in ABRF
have trainable parameters and use the Huber’s -contamination model [14] for defining the attention weights
such that each attention weight consists of two parts: the softmax operation with the tuning coefficient
1and the trainable bias of the softmax weight with coefficient . One of the improvements of ABRF,
which has been proposed in [15], is based on joint incorporating self-attention and attention mechanisms
into the RF. The proposed models outperform ABRF, but this outperformance is not sufficient. The model
for several dataset provided inferior results. Therefore, we propose a set of models which can be regarded
as extensions of ABRF and are based on two main ideas.
1
arXiv:2210.05168v1 [cs.LG] 11 Oct 2022
The first idea is to introduce a two-level attention, where one of the levels is the “leaf” attention, i.e.,
the attention mechanism is applied to every leaf of trees. As a result, we get the attention weights assigned
to leaves and the attention weights assigned to trees. At that, the attention weights of trees depend on the
corresponding weights of leaves belonging to these trees. In other words, the attention at the second level
depends on the attention at the first level, i.e., we get the attention of the attention. Due to the “leaf”
attention, the proposed model will be abbreviated as LARF (Leaf Attention-based Random Forest).
One of the peculiarities of LARFs is the use of a mixture of the Huber’s -contamination models instead
of the single contamination model as it has been done in ABRF. This peculiarity stems from the second
idea behind the model to take into account the softmax operation with different parameters simultaneously.
In fact, we replace the standard softmax operation by the weighted sum of the softmax operations with
different parameters. With this idea, we achieve two goals. First of all, we partially solve the problem of
tuning parameters of the softmax operations which are a part of attention operations. Each value of the
tuning parameter from a predefined set (from a predefined grid) is used in a separate softmax operation.
Then weights of the softmax operations in the sum are trained jointly with training other parameters.
This approach can also be interpreted as the linear approximation of the softmax operations with trainable
weights and with different values of tuning parameters. However, a more interesting goal is that some
analog of the multi-head attention [16] is implemented by using the mixture of contamination models
where “heads” are defined by selecting a value of the corresponding softmax operation parameter.
Additionally, in contrast to ABRF [10] where the contamination parameter of the Huber’s model
was a tuning parameter, the LARF model considers this parameter as the training one. This allows us to
significantly reduce the model tuning time and to avoid enumeration of the parameter values in accordance
with a grid. The same is implemented for the mixture of the Huber’s models.
Different configurations of LARF produce a set of models which depend on trainable parameters of the
two-level attention and its implementation.
We investigate two types of RFs in experiments: original RFs and Extremely Randomized Trees (ERT)
[17]. According to [17], the ERT algorithm chooses a split point randomly for each feature at each node
and then selects the best split among these.
Our contributions can be summarized as follows:
1. New two-level attention-based RF models are proposed, where the attention mechanism at the first
level is applied to every leaf of trees, the attention at the second level incorporates the “leaf” attention
and is applied to trees. Training of the two-level attention is reduced to solving the standard quadratic
optimization problem.
2. A mixture of the Huber’s -contamination models is used to implement the attention mechanism at the
second level. The mixture allows us to replace a set of tuning attention parameters (the temperature
parameters of the softmax operations) with trainable parameters whose optimal values are computed
by solving the quadratic optimization problem. Moreover, this approach can be regarded as an analog
of the multi-head attention.
3. An approach is proposed to make the tuning contamination parameters (parameters) in the mixture
of the -contamination models to be training. Their optimal values are also computed by solving the
quadratic optimization problem.
4. Many numerical experiments with real datasets are performed for studying LARFs. They demon-
strate outperforming results of some modifications of LARF. The code of proposed algorithms can
be found in https://github.com/andruekonst/leaf-attention-forest.
The paper is organized as follows. Related work can be found in Section 2. A brief introduction to the
attention mechanism as the Nadaraya-Watson kernel regression is given in Section 3. A general approach
to incorporating the two-level attention mechanism into the RF is provided in Section 4. Ways for imple-
mentation of the two-level attention mechanism and constructing several attention-based models by using
2
the mixture of the Huber’s -contamination models are considered in Section 5. Numerical experiments
with real data illustrating properties of the proposed models are provided in Section 6. Concluding remarks
can be found in Section 7.
2 Related work
Attention mechanism. Due to the great efficiency of machine learning models with the attention mech-
anisms, interest in the different attention-based models has increased significantly in recent years. As a
result, many attention models have been proposed to improve the performance of machine learning algo-
rithms. The most comprehensive analysis and description of various attention-based models can be found
in interesting surveys [1, 2, 3, 4, 5, 18].
It is important to note that parametric attention models as parts of neural networks are mainly trained
by applying the gradient-based algorithms which lead to computational problems when training is carried
out through the softmax function. Many approaches have been proposed to cope with this problem.
A large part of approaches is based on some kinds of linear approximation of the softmax attention of
[19, 20, 21, 22]. Another part of the approaches is based on random feature methods to approximate the
softmax function [18, 23].
Another improvement of the attention-based models is to use the self-attention which was proposed in
[16] as a crucial component of neural networks called Transformers. The self-attention models have been
also studied in surveys [4, 24, 25, 26, 27, 28]. This is only a small part of all works devoted to the attention
and self-attention mechanisms.
It should be noted that the aforementioned models are implemented as neural networks, and they have
not been studied for application to other machine learning models, for example, to RFs. Attempts to
incorporate the attention and self-attention mechanisms into the RF and the gradient boosting machine
were made in [9, 10, 15]. Following these works, we extend the proposed models to improve the attention-
based models. Moreover, we propose the attention models which do not use the gradient-based algorithms
for computing optimal attention parameters. The training process of the models is based on solving
standard quadratic optimization problems.
Weighted RFs. Many approaches have been proposed in recent years to improve RFs. One of the
important approaches is based on assignment of weights to decision trees in the RF. This approach is
implemented in various algorithms [29, 30, 31, 32, 33, 34, 35]. However, most these algorithms have an
disadvantage. The weights are assigned to trees independently of examples, i.e., each weight characterizes
trees on the average over all training examples and does not take into account any feature vector. Moreover,
the weights do not have training parameters which usually make the model more flexible and accurate.
Contamination model in attention mechanisms. There are several models which use imprecise
probabilities in order to model the lack of sufficient training data. One of the first models is the so-
called Credal Decision Tree, which is based on applying the imprecise probability theory to classification
and proposed in [36]. Following this work, a number of models based on imprecise probabilities were
presented in [37, 38, 39, 40] where the imprecise Dirichlet model is used. This model can be regarded
as a reparametrization of the imprecise -contamination model which is applied to LARF. The imprecise
-contamination model has been also applied to machine learning methods, for example, to the support
vector machine [41] or to the RF [42]. The attention-based RF applying the imprecise -contamination
model to the parametric attention mechanism was proposed in [10, 15]. However, there were no other
works which use the imprecise models in order to implement the attention mechanism.
3 Nadaraya-Watson regression and the attention mechanism
A basis of the attention mechanism can be considered in the framework of the Nadaraya-Watson kernel
regression model [12, 13] which estimates a function fas a locally weighted average using a kernel as
3
a weighting function. Suppose that the dataset is represented by nexamples (x1, y1), ..., (xn, yn), where
xi= (xi1, ..., xim)Rmis a feature vector consisting of mfeatures; yiRis a regression output. The
regression task is to construct a regressor f:RmRwhich can predict the output value ˜yof a new
observation x, using the dataset.
The Nadaraya-Watson kernel regression estimates the regression output ˜ycorresponding to a new input
feature vector xas follows [12, 13]:
˜y=
n
X
i=1
α(x,xi)yi,(1)
where weight α(x,xi) conforms with relevance of the feature vector xito the vector x.
It can be seen from the above that the Nadaraya-Watson regression model estimates ˜yas a weighted
sum of training outputs yifrom the dataset such that their weights depend on location of xirelative to x.
This mean that the closer xito x, the greater the weight assigned to yi.
According to the Nadaraya-Watson kernel regression [12, 13], weights can be defined by means of a
kernel Kas a function of the distance between vectors xiand x. The kernel estimates how xiis close to
x. Then the weight is written as follows:
α(x,xi) = K(x,xi)
Pn
j=1 K(x,xj).(2)
One of the popular kernel is the Gaussian kernel. It produces weights of the form:
α(x,xi) = σ kxxik2
τ!,(3)
where τis a tuning (temperature) parameter; σ(·) is a notation of the softmax operation.
In terms of the attention mechanism [43], vector x, vectors xi, outputs yiand weight α(x,xi) are
called as the query,keys,values and the attention weight, respectively. Weights α(x,xi) can be extended
by incorporating trainable parameters. In particular, parameter τcan be also regarded as the trainable
parameter.
Many forms of parametric attention weights, which also define the attention mechanisms, have been
proposed, for example, the additive attention [43], the multiplicative or dot-product attention [16, 44]. We
will consider also the attention weights based on the Gaussian kernels, i.e., producing the softmax operation.
However, the parametric forms of the attention weights will be quite different from many popular attention
operations.
4 Two-level attention-based random forest
One of the powerful machine learning models handling with tabular data is the RF which can be regarded
as an ensemble of Tdecision trees such that each tree is trained on a subset of examples randomly selected
from the training set. In the original RF, the final RF prediction ˜yfor a testing example xis determined
by averaging predictions ˜y1, ..., ˜yTobtained for all trees.
Let Jk(x) be the index set of examples which fall into the same leaf in the k-th tree as x. One of the
ways to construct the attention-based RF is to introduce the mean vector Ak(x) defined as the mean of
training vectors xjwhich fall into the same leaf as x. However, this simple definition can be extended by
incorporating the Nadaraya-Watson regression into the leaf. In this case, we can write
Ak(x) = X
j∈Jk(x)
µ(x,xj)xj,(4)
where µ(x,xj) is the attention weight in accordance with the Nadaraya-Watson kernel regression.
4
摘要:

LARF:Two-levelAttention-basedRandomForestswithaMixtureofContaminationModelsAndreiV.KonstantinovandLevV.UtkinPetertheGreatSt.PetersburgPolytechnicUniversitySt.Petersburg,Russiae-mail:andrue.konst@gmail.com,lev.utkin@gmail.comAbstractNewmodelsoftheattention-basedrandomforestscalledLARF(LeafAttention-b...

展开>> 收起<<
LARF Two-level Attention-based Random Forests with a Mixture of Contamination Models Andrei V. Konstantinov and Lev V. Utkin.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:630.38KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注