
2
problems without further computation. In comparison,
QSVMs would require one-against-all (OAA) and one-
against-one (OAO) [22] strategies which would respec-
tively require constructing |C| and |C|(|C|−1)/2 separate
QSVMs, where Cis the set of classes in the problem.
The expressive advantage of the QRF is predicated on
the distinguishability of data when mapped to a higher
dimensional feature space. This proposition is supported
by Cover’s Theorem [23], which states that data is more
likely to be separable when projected to a sparse higher
dimensional space. Cover’s result is also a principal mo-
tivator for developing both quantum and classical kernel
methods. However, it is often the case that generalisa-
tion suffers at the expense of a more expressive model – a
phenomena commonly known through the bias-variance
trade-off in statistical learning. To counter the expressiv-
ity of the (already quite expressive) quantum model, we
make a randomised low-rank Nystr¨om approximation [24]
of the kernel matrix supplied to the SVM, limiting the
ability of the model to overfit. We refer to this strategy
as Nystr¨om-QKE (NQKE). We show that the inclusion
of this approximation further allows for a reduced circuit
sampling complexity.
The paper is structured as follows: in Section II we ad-
dress the exact form of the QRF model, before providing
theoretical results in III. This is followed by numerical re-
sults and discussion in Section IV. Background to many
of the ML techniques are presented in Appendix A.
II. THE QUANTUM RANDOM FOREST
MODEL
The Quantum Random Forest (QRF) is composed
of a set of Tdistinct quantum decision trees (QDTs),
{Qt}T
t=1, which form the weak independent classifiers of
the QRF ensemble. This is illustrated in Figure 1. As
with all supervised ML methods, the QDTs are given the
set of Ntraining instances, xtrain
i∈RD, and their associ-
ated ncclass labels, ytrain
i∈ C ={0,1, ..., nc−1}, to learn
from the annotated data set, S={(xtrain
i, ytrain
i)}N
i=1
sampled from some underlying distribution D. Each tree
is trained using Np≤N– which we will refer to as
the partition size – instances sampled from the original
dataset S. This acts as both a regularisation strategy, as
well as a way to reduce the time complexity of the model.
Such a technique is known as bagging [25] or bootstrap
aggregation and it has the ability to greatly minimise the
probability of overfitting.
Once each classifier is trained, the model is tested with
a separate set, Stest of size N0, from which we compare
the predicted class distribution with the real. To obtain
the overall prediction of the QRF model, predictions of
each QDT is averaged across the ensemble. Specifically,
each QDT returns a probability distribution, Qt(~x;c) =
Pr(c|~x) for c∈ C, with the overall prediction as the class
with the greatest mean across the ensemble. Hence, given
an input, ~x, we have prediction of the form,
ey(~x) = arg max
c∈C (1
T
T
X
t=1
Qt(~x;c)).(1)
The accuracy of the model can be obtained with
the comparison of predictions with the real labels,
1/N0P(~x,y)∈Stest δy,ey(~x)where δi,j is the Kronecker delta.
The distinction between training and testing data is gen-
erally critical to ensure that the model is not overfitting
to the trained dataset and hence generalises to the pre-
diction of previously unknown data points. We now elab-
orate on the structure of the QDT model.
A. A Quantum Decision Tree
The Quantum Decision Tree (QDT) proposed in this
work is an independent classification algorithm that has a
directed graph structure identical to that of a binary de-
cision tree – illustrated in Figure 1and elaborated further
in Appendix A 1. Each vertex, also referred to as a node,
can either be a split or leaf node, distinguished by colour
in Figure 1. A leaf node determines the classification
output of the QDT, while a split node separates inserted
data points into two partitions that subsequently con-
tinue down the tree. The split effectiveness is measured
by the reduction in entropy, referred to as the informa-
tion gain (IG). Given a labelled data set, S, that is split
into partitions SLand SRby the split function, we have,
IG(S;SL,SR) = H(S)−X
i∈{L,R}
|Si|
|S| H(Si),(2)
where His the entropy over the class distribution defined
as H(S) = −Pc∈C Pr(c) log2Pr(c) where class c∈ C oc-
curs with probability Pr(c) in the set S. Clearly the IG
will increase as the splitting more distinctly separates in-
stances of different classes. Using information gain is ad-
vantageous especially when more than two classes are in-
volved and an accuracy-based effectiveness measure fails.
Given a partitioned training set at the root node, the
QDT carries out this splitting process to the leaves of
the tree where a prediction of class distribution is made
from the proportions of remaining data points in each
class. Mathematically, with a subset of data points S(l)
supplied at training to leaf `(l)indexed by l, we set its
prediction as the probability distribution,
`(l)(S(l);c) = 1
|S(l)|X
(~x,y)∈S(l)y=c,(3)
where [p] is the Iversion bracket that returns 1 when the
proposition pis true and 0 otherwise. When training the
model, a node is defined a leaf if any of the following con-
ditions are met, (i) the training instances supplied to the
node are of a single class – in which case further splitting
is unnecessary, (ii) when the number of data points left