the mixture of the Huber’s -contamination models are considered in Section 5. Numerical experiments
with real data illustrating properties of the proposed models are provided in Section 6. Concluding remarks
can be found in Section 7.
2 Related work
Attention mechanism. Due to the great efficiency of machine learning models with the attention mech-
anisms, interest in the different attention-based models has increased significantly in recent years. As a
result, many attention models have been proposed to improve the performance of machine learning algo-
rithms. The most comprehensive analysis and description of various attention-based models can be found
in interesting surveys [1, 2, 3, 4, 5, 18].
It is important to note that parametric attention models as parts of neural networks are mainly trained
by applying the gradient-based algorithms which lead to computational problems when training is carried
out through the softmax function. Many approaches have been proposed to cope with this problem.
A large part of approaches is based on some kinds of linear approximation of the softmax attention of
[19, 20, 21, 22]. Another part of the approaches is based on random feature methods to approximate the
softmax function [18, 23].
Another improvement of the attention-based models is to use the self-attention which was proposed in
[16] as a crucial component of neural networks called Transformers. The self-attention models have been
also studied in surveys [4, 24, 25, 26, 27, 28]. This is only a small part of all works devoted to the attention
and self-attention mechanisms.
It should be noted that the aforementioned models are implemented as neural networks, and they have
not been studied for application to other machine learning models, for example, to RFs. Attempts to
incorporate the attention and self-attention mechanisms into the RF and the gradient boosting machine
were made in [9, 10, 15]. Following these works, we extend the proposed models to improve the attention-
based models. Moreover, we propose the attention models which do not use the gradient-based algorithms
for computing optimal attention parameters. The training process of the models is based on solving
standard quadratic optimization problems.
Weighted RFs. Many approaches have been proposed in recent years to improve RFs. One of the
important approaches is based on assignment of weights to decision trees in the RF. This approach is
implemented in various algorithms [29, 30, 31, 32, 33, 34, 35]. However, most these algorithms have an
disadvantage. The weights are assigned to trees independently of examples, i.e., each weight characterizes
trees on the average over all training examples and does not take into account any feature vector. Moreover,
the weights do not have training parameters which usually make the model more flexible and accurate.
Contamination model in attention mechanisms. There are several models which use imprecise
probabilities in order to model the lack of sufficient training data. One of the first models is the so-
called Credal Decision Tree, which is based on applying the imprecise probability theory to classification
and proposed in [36]. Following this work, a number of models based on imprecise probabilities were
presented in [37, 38, 39, 40] where the imprecise Dirichlet model is used. This model can be regarded
as a reparametrization of the imprecise -contamination model which is applied to LARF. The imprecise
-contamination model has been also applied to machine learning methods, for example, to the support
vector machine [41] or to the RF [42]. The attention-based RF applying the imprecise -contamination
model to the parametric attention mechanism was proposed in [10, 15]. However, there were no other
works which use the imprecise models in order to implement the attention mechanism.
3 Nadaraya-Watson regression and the attention mechanism
A basis of the attention mechanism can be considered in the framework of the Nadaraya-Watson kernel
regression model [12, 13] which estimates a function fas a locally weighted average using a kernel as
3