Limitations of neural network training due to numerical instability of backpropagation Clemens KarnerVladimir KazeevPhilipp Christian Petersen

2025-05-03 0 0 1.49MB 25 页 10玖币
侵权投诉
Limitations of neural network training due to numerical instability
of backpropagation
Clemens Karner Vladimir Kazeev Philipp Christian Petersen
November 16, 2023
Abstract
We study the training of deep neural networks by gradient descent where floating-point arithmetic
is used to compute the gradients. In this framework and under realistic assumptions, we demonstrate
that it is highly unlikely to find ReLU neural networks that maintain, in the course of training with
gradient descent, superlinearly many affine pieces with respect to their number of layers. In virtually
all approximation theoretical arguments which yield high order polynomial rates of approximation,
sequences of ReLU neural networks with exponentially many affine pieces compared to their numbers
of layers are used. As a consequence, we conclude that approximating sequences of ReLU neural
networks resulting from gradient descent in practice differ substantially from theoretically constructed
sequences. The assumptions and the theoretical results are compared to a numerical study, which
yields concurring results.
Keywords: : Round-off errors, deep neural networks, gradient descent, numerical stability.
Mathematics Subject Classification: 35A18, 65T60, 68T10.
1 Introduction
Deep learning is a machine learning technique based on artificial neural networks which are trained by
gradient-based methods and which have a large number of layers. This technique has been tremendously
successful in a wide range of applications [
26
,
24
,
44
,
41
]. Of particular interest for applied mathematicians
are recent developments in which deep neural networks are applied to tasks of numerical analysis such
as the numerical solution of inverse problems [
1
,
34
,
27
,
20
,
38
] or of (parametric) partial differential
equations [7, 12, 39, 9, 40, 25, 29, 3].
The appeal of deep neural networks for these applications is due to their exceptional efficiency in
representing functions from several approximation classes that underlie well-established numerical methods.
In terms of approximation accuracy with respect to the number of approximation parameters, deep neural
networks have been theoretically proven to achieve approximation rates that are at least as good as
those of finite elements [
15
,
35
,
30
], local Taylor polynomials or splines [
47
,
11
], wavelets [
42
] and, more
generally, affine systems [5].
In the sequel, we consider neural networks with the rectified-linear-unit (ReLU) activation function,
which is standard in most applications. In this case, the neural-network approximations are piecewise-
affine functions. We point out that all state-of-the-art results on the rates of approximation with deep
ReLU neural networks that achieve higher order polynomial approximation rates are based on explicit
constructions with the number of affine pieces growing exponentially with respect to the number of layers;
see, e.g., [47, 46].
In this work, we argue that this central building block, functions with exponentially many affine pieces,
cannot be learned with the state-of-the-art techniques. We show theoretically that training in floating-point
arithmetic is hindered by a bound on the number of affine pieces created during an iterative learning
University of Vienna, Faculty of Mathematics. Kolingasse 14–16, 1090 Vienna, Austria.
{clemens.karner,vladimir.kazeev,philipp.petersen}@univie.ac.at
. VK was partially supported by the Austrian
Science Fund (FWF) under project F65 “Taming Complexity in Partial Differential Systems”.
University of Vienna, Research Network Data Science. Kolingasse 14–16, 1090 Vienna, Austria.
1
arXiv:2210.00805v4 [cs.LG] 15 Nov 2023
process. This bound is polynomial with respect to the number of approximation parameters and also
with respect to the number of iterations. Notably, this non-computability of functions with exponentially
many affine pieces is not derived from an abstract result of computability theory, as the related results
established in [
4
,
6
]; instead, we identify a concrete reason why gradient descent based on backpropagation
cannot produce these functions in floating-point arithmetic. The effect of numerical instability is also
demonstrated in numerical examples. We stress that our results do not imply that theoretically derived
approximation rates cannot be realized with learned functions. Instead, we can merely conclude that the
approximating sequences of neural networks found in practice must be fundamentally different than the
theoretically derived ones. We will give a detailed description of our findings in Subsection 1.3.
This work is strongly influenced by [
14
], which shows that functions with exponentially many affine
pieces with respect to the number of layers typically do not appear in randomly initialised neural networks.
The ideas presented in that work form the basis of our analysis.
Before presenting our main results, we recapitulate the notion of floating-point arithmetic and furnish
an example illustrating how the finite precision of floating-point arithmetic can undermine learning due
to the phenomenon known in numerical analysis as catastrophic cancellation.
1.1 Floating-point arithmetic
Computations are performed almost exclusively in binary floating-point arithmetic, which consists in
restricting the arithmetic of real numbers to a discrete set of the form
M=±2e
p
X
k=0
2kck:c0= 1, c1, . . . , cs∈ {0,1}, e ∈ {emin, . . . , emax},(1.1)
extended with a few special elements, such as zero, infinity and NaN elements. Here, the radix of the
floating-point arithmetic is two, the parameter
pN
is the precision and
emin, emax Z
are the minimum
and maximum exponents. Naturally,
M
is not closed under basic arithmetic operations such as addition
and multiplication, which necessitates rounding. The round-to-nearest addition and multiplication are
defined so that
|a+Mb(a+b)|= min
sM|s(a+b)|,|a×Mba·b|= min
sM|sab|(1.2)
for all
a, b M
. The precision
p
and the above rounding together define the so-called machine epsilon
ϵ
=
1
2
2
1p
, which is an accurate upper bound for the relative accuracy of the floating-point addition and
multiplication:
|a+Mb(a+b)| ≤ ϵ|a+b|and |a×Mba·b| ≤ ϵ|a·b|(1.3)
for all
a, b M
. For the double- and single-precision IEEE 754 standards, we have
ϵ
= 2
53
1
.
11
·
10
16
and
ϵ
= 2
24
5
.
96
·
10
8
respectively. Note that computation in floating point arithmetic amounts to
introducing a relative error in each computation.
1.2 Instability of deep neural networks due to catastrophic cancellation
The bounds
(1.3)
allow for floating-point operations to be modelled as perturbations of the respective
infinite-precision operations of the real arithmetic.
Consider a feed-forward ReLU neural network with
LN
layers with
d
=
N0N
real inputs and
NjN
neurons in each layer
jN
. The evaluation
ϕ:RdRNL
of such a neural network at a point
x(0) Rdconsists of iteratively applying the transformations
x(j)=ϱ(Ajx(j1) +bj)(1.4)
sequentially for
j
= 1
, . . . , L
, which produces the corresponding output
ϕ
(
x(0)
) =
x(L)RNL
. Here,
ϱ:RR
given by
ϱ
(
x
) =
max{x,
0
}
for all
xR
is the ReLU activation function, which is applied to
the elements of
RNj
with each
j∈ {
1
, . . . , L}
componentwise, whereas
AjRNj1×Nj
and
bjRNj
with
j∈ {1, . . . , L}are the weight matrices and bias vectors.
For simplicity, let us consider the case of
bj
= 0 for all
j∈ {
1
, . . . , L}
and assume that
x
=
x(0) Rd
and A1, . . . , ALare such that all entries of x(j)are nonnegative for each j∈ {1, . . . , L}. Then
ϕ(x) = AL···A1x, (1.5)
2
i.e., the evaluation of the network reduced to the multiplication of the input by a matrix product consisting
of Lfactors.
The instability of such products for large
L
was demonstrated in the context of tensor decompositions,
in [
2
, Example 3]: the first component of the tensor considered therein yields the matrix product
(1.5)
with N0=NL= 1,N1=··· =NL1= 2 and with weight matrices
A1=(1 + ε1)a
a, AL=(1 + εL)(1 + aL)aaand Aj=(1 + εj)a0
0a(1.6)
for each
j
= 2
, . . . , L
1, where
a >
1is a fixed parameter and
ε1, . . . , εL
0are perturbation
parameters. For all
j∈ {
1
, . . . , L}
, the case of
ε1
=
···
=
εj1
=
εj+1
=
···
=
εL
= 0 and
εj
0gives
x(L)
(
ε
) =
ϕ
(
x(0)
) = 1 +
εj
(
aL
+ 1)
x(0)
for any
x(0) Rd
. Considering
ε
as a perturbation parameter,
which may be at the level of the machine epsilon
ϵ
, we observe that a relative perturbation of magnitude
εin the jth weight matrix leads to the error
x(L)(ε)x(L)(0) = ε(aL+ 1) x(L)(0) ,
which implies the total loss of accuracy for any
ε >
0whenever
aLε1
. This effect is an immediate
consequence of the subtraction encoded by
AL
, in which the two terms may be perturbed individually
(just as the first is perturbed by a multiplicative factor of 1 +
ε
and the second is not perturbed at all in
the above example). This is an archetypal example of the so-called catastrophic cancellation; see, e.g., [
17
,
Section 1.7].
This example, based on inaccurate matrix multiplication, can be generalised into the following
statement for deep neural networks. The precise statement is given as Proposition A.3 in the Appendix.
Proposition 1.1. Let
LN
and
NN
. Let
a
= 2
r
1for
rZ
and let
M
be as in
(1.1)
, with
emin
0,
emax >
53 +
r
,
p
= 53 and hence
ϵ
= 2
53
. If (
L
3)
log2
(
N
1) + (
L
1)
log2
(
a
2
ϵ
)
54,
then there is a neural network Φ
a,N,L
with
L
layers,
N
neurons per layer, and all weights bounded in
absolute value below by 1 and above by asuch that
|Φa,N,L(x)Φa,N,L(x)|=|Φa,N,L(x)|=|x|
for all
x
0, where
Φa,N,L(x)
is the neural network Φ
a,N,L
evaluated in floating-point arithmetic with
the machine epsilon ϵ.
In the finite-precision setting, we observe that the evaluation of very small neural networks can
already lead to a relative error of 1compared to the neural network with arbitrary precision. Admissible
parameters such that the conclusion of Proposition 1.1 holds are, for example,
N
= 65,
a
= 10, and
L
= 8.
Proposition 1.1 considers only the forward application of a neural network. However, the construction
of this neural network is based on the example
(1.6)
and is such that the neural network is entirely linear.
By the backpropagation rule, recalled in Definition 2.3, the reversal of the neural network constructed in
Proposition 1.1 yields a neural network such that the accurate computation of the gradient of the weights
with respect to the output of the neural network is infeasible in floating-point arithmetic.
1.3 Contribution
In this work, we analyse the effect of inaccurate computations stemming from floating-point arithmetic
in a more moderate regime than that of Proposition 1.1. In contrast to the error amplification, that was
the basis of Example 1.6, we assume the errors in the following to stem mostly from accumulation of
many inaccurate operations. We assume that, in the computation of the gradient of a neural network,
in each layer, a small and controlled relative error will appear. Under this assumption, we observe that
neural networks trained with gradient descent will, with high probability, not exhibit exponentially many
affine pieces with respect to the number of layers.
We describe a framework for learning in floating-point arithmetic in Section 2. There, we define a
gradient descent procedure where the gradient is computed with noisy updates.
To facilitate our main results, we make two assumptions on the gradient descent procedure (Assump-
tions 2.6 and 2.7), which can be intuitively formulated as follows.
3
(A)
We assume that the average of the reciprocals of the non-zero bias updates in an iteration is bounded
by a polynomial in the maximum number
N
of neurons in a single layer. Moreover, every neuron
with zero output (dead neuron) is assumed to remain so after the iteration. Both assumptions
together essentially require the gradient of every bias to be either zero or not too small for most
neurons. We verify this assumption numerically in Section 4.
(B)
For all layers
jN
, the derivative of each coordinate of the preactivations, i.e.,
Ajx(j1)
+
bj
, with
respect to the input of the neural network is bounded by a uniform constant.
Under these assumptions, we prove Theorem 3.4, which can be intuitively phrased as follows: the
expected number of affine pieces that the neural network has after a single perturbed gradient descent
update has an upper bound that is polynomial (in practice quadratic) in the number of neurons, linear in
the number of layers and inversely proportional to the level of perturbation.
The result is based on the following insight of [
14
]. Consider as a prototypical neuron a function of
the form
ϱ
(
hb
), where
h
is a piecewise affine function of bounded variation. Then, for
ϱ
(
hb
)to have
more affine pieces than
h
it is necessary that
h
(
x
) =
b
for some
x
in the domain of
h
. Intuitively, for
many such
x
to exist,
hb
has to change sign many times. If
b
is a random variable, then, for the event
h
(
x
) =
b
to happen with high probability,
h
needs to repeatedly cross a substantial part of the region
where
b
has most of its mass. This, however, can only happen to some degree since the variation of
h
was assumed to be bounded. The resulting estimate of expected newly generated affine pieces in one
neuron under random biases is formalised in Lemma 3.1. Since the floating-point operations are assumed
to introduce noise into the bias updates in a way quantified in Definition 2.3, and since Assumption
(B) implies bounded variation of the outputs of each neuron, we deduce that with high probability the
number of affine pieces in a neural network is bounded.
In a gradient descent iteration, the noise in each gradient update step also depends on the current
empirical risk. Notably, if the neural network is initialised with zero-training error, no update happens.
However, under standard assumptions on the dynamics of the loss, we can still prove that the number of
affine pieces during the iteration does not increase quickly. This is the content of Theorem 3.6, which we
state below in a simplified and informal version:
Theorem 1.2 (Informal version of Theorem 3.6).During gradient descent, where the gradient is computed
by backpropagation in floating-point arithmetic, it holds for each iteration with probability 1
/
2that the
number of affine pieces of a neural network along a given line grows at most polynomially (almost linearly)
with respect to the number of iterations, polynomial in the number of neurons, and linear in the number
of layers.
Assumptions (A) and (B) will be discussed in detail after their formal statement in Section 2. Since
Assumption (A) relates to the distribution of updates, we numerically study if it is satisfied in practice in
Section 4.
Finally, in Section 5, we numerically analyse the influence of the magnitude of round-off errors. We
observe that, while lower numerical accuracy influences the generation of affine pieces negatively, as
claimed by our main theorem, the effect is not as pronounced as the direct application of the theorem
may suggest. The reason is that, for the approximations involved, the number of affine pieces is already
significantly lower than our upper bounds. Hence we conclude that numerical stability is one reason that
prevents the learning of functions with exponentially many affine pieces, but there may be many more
such reasons that prevent the number of affine pieces from reaching the theoretically established threshold.
If the number of affine pieces is large for the initial neural network, it reduces in the course of training; in
accordance with our theoretical results, that reduction occurs faster for lower computational accuracies.
1.4 Related work
This work represents a counter-point to many approximation theoretical approaches used in the
analysis of deep learning. Here, we study to what extent piecewise-affine functions with a number of affine
pieces that is exponential with respect to the number of layers can be learned by deep neural networks
using gradient descent algorithms. Functions with high numbers of affine pieces are necessary in the
approximation of the square function [
47
], which forms the basis for many approximation results in the
literature, e.g., [48, 35, 30, 37, 40, 11].
4
The primary motivation for the approach that we follow is taken from [
14
]. There it is shown that
randomly initialised neural networks typically have few affine pieces. The main idea of that work is to
show that to generate many affine pieces, the bias vectors in neural networks need to be very accurately
chosen. This is unlikely to be achieved if the biases are initialised randomly. This very idea is also the
basis behind our analysis. We cannot, however, simply apply the main results of [
14
] since analysing the
whole iterative procedure of gradient descent requires keeping track of the interdependence of the random
variables modelling the weights and biases of the network and estimating their variance throughout the
process.
Floating-point arithmetic and its effect on iterated computations have been studied in the literature
before. The example given by
(1.5)
(1.6)
is derived from the study of the stability of long tensor
factorisations in [
2
]. The explicit constructions of low-rank tensor approximations presented in [
23
,
21
,
31
,
22
] all, depending on the specific implementation, may suffer from the instability similar to that
demonstrated in (1.5)–(1.6).
In [
28
], an empirical study is presented that finds that the accuracy affects the overall performance of a
neural network classifier. In this context, also an effect of the number of layers is measured, showing that
deeper neural networks are more sensitive to low accuracies. Another analysis is given in [
45
], where it is
shown that numerical inaccuracies introduce stochasticity into the trajectories of trained neural networks,
which is similar to noise in stochastic gradient descent. In the context of fixed-point arithmetic, [
10
] showed
that in many classification problems, one can restrict the computation accuracy significantly without
harming the classification accuracy. There is also a vast body of literature studying how neural networks
can be efficiently quantised to speed up computations or minimise storage requirements [
49
,
13
,
19
,
18
].
Typically it is found that moderate quantisation or specifically designed quantisation does not harm
classification accuracies. To the best of our knowledge, the effect on the generation of many affine pieces
has not been studied.
There have also been more abstract studies showing that on digital hardware or in finite precision
computations, there are learning scenarios that cannot be solved to any reasonable accuracy; see [6, 4].
2 Framework
In this section, we introduce the framework for this article. We start by formally introducing neural
networks in Definition 2.1. Thereafter, we fix the notion of number of affine pieces of a neural network in
Definition 2.2. To study the effect of numerical accuracy on the convergence and performance of gradient
descent based learning, we define the perturbed gradient descent update in Definition 2.3 and the full
perturbed gradient descent iteration in Definition 2.5.
To facilitate the results in the sequel, we make Assumptions 2.6 and 2.7 on the gradient descent
dynamics. Assumption 2.6 requires the absolute value of the derivative of the objective function with
respect to each of the biases to either vanish or be bounded from below. In the event that the derivative
vanishes, we assume that the associated neuron is dead, i.e., its output is constant on the whole domain.
Assumption 2.7 stipulates that the output of each neuron should have a bounded derivative with respect
to the input variable. We will discuss the sensibility of these assumptions at the end of this section.
We start by defining a neural network. Here we focus only on the most widely used activation function,
the ReLU, ϱ(x):= max{0, x}, for xR.
Definition 2.1 ([
37
,
35
]).Let
d, L N
. A neural network (NN) with input dimension
d
and
L
layers is
a sequence of matrix-vector tuples
Φ:=(A1, b1),(A2, b2),...,(AL, bL),
where
N0:
=
d
and
N1, . . . , NLN
, and where
AjRNj×Nj1
and
bjRNj
for
j
= 1
, ..., L
. The number
NLis referred to as the output dimension.
For
j∈ {
1
, . . . , L
1
}
, the subnetwork from layer 1to layer
j
of Φis the sequence of matrix-vector
tuples
Φj:=(A1, b1),(A2, b2),...,(Aj, bj).
For a NN Φand a domain Rd, we define the associated realisation of the NN Φas
R(Φ) : Ω RNL:x7→ x(L):= R(Φ)(x),
5
摘要:

LimitationsofneuralnetworktrainingduetonumericalinstabilityofbackpropagationClemensKarner∗VladimirKazeev∗†PhilippChristianPetersen∗†November16,2023AbstractWestudythetrainingofdeepneuralnetworksbygradientdescentwherefloating-pointarithmeticisusedtocomputethegradients.Inthisframeworkandunderrealistica...

展开>> 收起<<
Limitations of neural network training due to numerical instability of backpropagation Clemens KarnerVladimir KazeevPhilipp Christian Petersen.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:1.49MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注