a neural network can be found for either fully connected or
convolutional neural networks which may include skip lay-
ers and normalizations as well. Besides the interpretability
aspect, we show that the induced tree is also advantageous
to the corresponding neural network in terms of computa-
tional complexity, at the expense of increased storage mem-
ory.
Upon writing this paper, we have noticed the following
works having overlaps with ours [28], [3], [15], [21], es-
pecially for feedforward ReLU networks. We extend the
findings in these works to any activation function and also
recurrent neural networks.
2. Decision Tree Analysis of Neural Networks
The derivations in this section will be first made for feed-
forward neural networks with piece-wise linear activation
functions such as ReLU, Leaky ReLU, etc. Next, the analy-
sis will be extended to any neural network with any activa-
tion function.
2.1. Fully Connected Networks
Let Wibe the weight matrix of a network’s ith layer.
Let σbe any piece-wise linear activation function, and x0
be the input to the neural network. Then, the output and an
intermediate feature of a feed-forward neural network can
be represented as in Eq. 1.
NN(x0) = WT
n−1σ(WT
n−2σ(...WT
1σ(WT
0x0)))
xi=σ(WT
i−1σ(...WT
1σ(WT
0x0))) (1)
Note that in Eq. 1, we omit any final activation (e.g.
softmax) and we ignore the bias term as it can be simply
included by concatenating a 1value to each xi. The acti-
vation function σacts as an element-wise scalar multiplica-
tion, hence the following can be written.
WT
iσ(WT
i−1xi−1) = WT
i(ai−1(WT
i−1xi−1)) (2)
In Eq. 2,ai−1is a vector indicating the slopes of activa-
tions in the corresponding linear regions where WT
i−1xi−1
fall into, denotes element-wise multiplication. Note that,
ai−1can directly be interpreted as a categorization result
since it includes indicators (slopes) of linear regions in ac-
tivation function. The Eq. 2can be re-organized as follows.
WT
iσ(WT
i−1xi−1)=(Wiai−1)TWT
i−1xi−1(3)
In Eq. 3, we use as a column-wise element-wise
multiplication on Wi. This corresponds to element-wise
multiplication by a matrix obtained via by repeating ai−1
column-vector to match the size of Wi. Using Eq. 3, Eq. 1
can be rewritten as follows.
NN(x0)=(Wn−1an−2)T(Wn−2an−3)T
...(W1a0)TWT
0x0
(4)
From Eq. 4, one can define an effective weight matrix
ˆ
WT
iof a layer ito be applied directly on input x0as follows:
Ci−1ˆ
WT
i= (Wiai−1)T...(W1a0)TWT
0
Ci−1ˆ
WT
ix0=WT
ixi
(5)
In Eq. 5, the categorization vector until layer iis de-
fined as follows: ci−1=a0ka1k...ai−1, where kis the
concatenation operator.
One can directly observe from Eq. 5that, the effec-
tive matrix of layer iis only dependent on the categoriza-
tion vectors from previous layers. This indicates that in
each layer, a new efficient filter is selected -to be applied
on the network input- based on the previous categoriza-
tions/decisions. This directly shows that a fully connected
neural network can be represented as a single decision tree,
where effective matrices acts as categorization rules. In
each layer i, response of effective matrix Ci−1ˆ
WT
iis catego-
rized into aivector, and based on this categorization result,
next layer’s effective matrix Ciˆ
WT
i+1 is determined. A layer
iis thus represented as kmi-way categorization, where miis
the number filters in each layer iand kis the total number
of linear regions in an activation. This categorization in a
layer ican thus be represented by a tree of depth mi, where
a node in any depth is branched into kcategorizations.
In order to better illustrate the equivalent decision tree of
a neural network, in Algorithm 1, we rewrite Eq. 5for the
entire network, as an algorithm. For the sake of simplic-
ity and without loss of generality, we provide the algorithm
with the ReLU activation function, where a∈ {0,1}. It
can be clearly observed that, the lines 5−9in Algorithm 1
corresponds to a node in the decision tree, where a simple
yes/no decision is made.
The decision tree equivalent of a neural network can thus
be constructed as in Algorithm 2. Using this algorithm,
we share a a tree representation obtained for a neural net-
work with three layers, having 2,1 and 1 filter for layer 1,
2 and 3 respectively. The network has ReLU activation in
between layers, and no activation after last layer. It can be
observed from Algorithm 2and Fig. 1that the depth of a
NN-equivalent tree is d=Pn−2
i=0 mi, and total number of
categories in last branch is 2d. At first glance, the number
of categories seem huge. For example, if first layer of a neu-
ral network contains 64 filters, there would exist at least 264
branches in a tree, which is already intractable. But, there