(A)
We assume that the average of the reciprocals of the non-zero bias updates in an iteration is bounded
by a polynomial in the maximum number
N
of neurons in a single layer. Moreover, every neuron
with zero output (dead neuron) is assumed to remain so after the iteration. Both assumptions
together essentially require the gradient of every bias to be either zero or not too small for most
neurons. We verify this assumption numerically in Section 4.
(B)
For all layers
j∈N
, the derivative of each coordinate of the preactivations, i.e.,
Ajx(j−1)
+
bj
, with
respect to the input of the neural network is bounded by a uniform constant.
Under these assumptions, we prove Theorem 3.4, which can be intuitively phrased as follows: the
expected number of affine pieces that the neural network has after a single perturbed gradient descent
update has an upper bound that is polynomial (in practice quadratic) in the number of neurons, linear in
the number of layers and inversely proportional to the level of perturbation.
The result is based on the following insight of [
14
]. Consider as a prototypical neuron a function of
the form
ϱ
(
h−b
), where
h
is a piecewise affine function of bounded variation. Then, for
ϱ
(
h−b
)to have
more affine pieces than
h
it is necessary that
h
(
x
) =
b
for some
x
in the domain of
h
. Intuitively, for
many such
x
to exist,
h−b
has to change sign many times. If
b
is a random variable, then, for the event
h
(
x
) =
b
to happen with high probability,
h
needs to repeatedly cross a substantial part of the region
where
b
has most of its mass. This, however, can only happen to some degree since the variation of
h
was assumed to be bounded. The resulting estimate of expected newly generated affine pieces in one
neuron under random biases is formalised in Lemma 3.1. Since the floating-point operations are assumed
to introduce noise into the bias updates in a way quantified in Definition 2.3, and since Assumption
(B) implies bounded variation of the outputs of each neuron, we deduce that with high probability the
number of affine pieces in a neural network is bounded.
In a gradient descent iteration, the noise in each gradient update step also depends on the current
empirical risk. Notably, if the neural network is initialised with zero-training error, no update happens.
However, under standard assumptions on the dynamics of the loss, we can still prove that the number of
affine pieces during the iteration does not increase quickly. This is the content of Theorem 3.6, which we
state below in a simplified and informal version:
Theorem 1.2 (Informal version of Theorem 3.6).During gradient descent, where the gradient is computed
by backpropagation in floating-point arithmetic, it holds for each iteration with probability 1
/
2that the
number of affine pieces of a neural network along a given line grows at most polynomially (almost linearly)
with respect to the number of iterations, polynomial in the number of neurons, and linear in the number
of layers.
Assumptions (A) and (B) will be discussed in detail after their formal statement in Section 2. Since
Assumption (A) relates to the distribution of updates, we numerically study if it is satisfied in practice in
Section 4.
Finally, in Section 5, we numerically analyse the influence of the magnitude of round-off errors. We
observe that, while lower numerical accuracy influences the generation of affine pieces negatively, as
claimed by our main theorem, the effect is not as pronounced as the direct application of the theorem
may suggest. The reason is that, for the approximations involved, the number of affine pieces is already
significantly lower than our upper bounds. Hence we conclude that numerical stability is one reason that
prevents the learning of functions with exponentially many affine pieces, but there may be many more
such reasons that prevent the number of affine pieces from reaching the theoretically established threshold.
If the number of affine pieces is large for the initial neural network, it reduces in the course of training; in
accordance with our theoretical results, that reduction occurs faster for lower computational accuracies.
1.4 Related work
This work represents a counter-point to many approximation theoretical approaches used in the
analysis of deep learning. Here, we study to what extent piecewise-affine functions with a number of affine
pieces that is exponential with respect to the number of layers can be learned by deep neural networks
using gradient descent algorithms. Functions with high numbers of affine pieces are necessary in the
approximation of the square function [
47
], which forms the basis for many approximation results in the
literature, e.g., [48, 35, 30, 37, 40, 11].
4