Limitations of neural network training due to numerical instability of backpropagation Clemens KarnerVladimir KazeevPhilipp Christian Petersen

2025-05-03 0 0 1.49MB 25 页 10玖币

侵权投诉

Limitations of neural network training due to numerical instability

of backpropagation

Clemens Karner ∗Vladimir Kazeev ∗† Philipp Christian Petersen ∗†

November 16, 2023

Abstract

We study the training of deep neural networks by gradient descent where ﬂoating-point arithmetic

is used to compute the gradients. In this framework and under realistic assumptions, we demonstrate

that it is highly unlikely to ﬁnd ReLU neural networks that maintain, in the course of training with

gradient descent, superlinearly many aﬃne pieces with respect to their number of layers. In virtually

all approximation theoretical arguments which yield high order polynomial rates of approximation,

sequences of ReLU neural networks with exponentially many aﬃne pieces compared to their numbers

of layers are used. As a consequence, we conclude that approximating sequences of ReLU neural

networks resulting from gradient descent in practice diﬀer substantially from theoretically constructed

sequences. The assumptions and the theoretical results are compared to a numerical study, which

yields concurring results.

Keywords: : Round-oﬀ errors, deep neural networks, gradient descent, numerical stability.

Mathematics Subject Classiﬁcation: 35A18, 65T60, 68T10.

1 Introduction

Deep learning is a machine learning technique based on artiﬁcial neural networks which are trained by

gradient-based methods and which have a large number of layers. This technique has been tremendously

successful in a wide range of applications [

]. Of particular interest for applied mathematicians

are recent developments in which deep neural networks are applied to tasks of numerical analysis such

as the numerical solution of inverse problems [

] or of (parametric) partial diﬀerential

equations [7, 12, 39, 9, 40, 25, 29, 3].

The appeal of deep neural networks for these applications is due to their exceptional eﬃciency in

representing functions from several approximation classes that underlie well-established numerical methods.

In terms of approximation accuracy with respect to the number of approximation parameters, deep neural

networks have been theoretically proven to achieve approximation rates that are at least as good as

those of ﬁnite elements [

], local Taylor polynomials or splines [

], wavelets [

] and, more

generally, aﬃne systems [5].

In the sequel, we consider neural networks with the rectiﬁed-linear-unit (ReLU) activation function,

which is standard in most applications. In this case, the neural-network approximations are piecewise-

aﬃne functions. We point out that all state-of-the-art results on the rates of approximation with deep

ReLU neural networks that achieve higher order polynomial approximation rates are based on explicit

constructions with the number of aﬃne pieces growing exponentially with respect to the number of layers;

see, e.g., [47, 46].

In this work, we argue that this central building block, functions with exponentially many aﬃne pieces,

cannot be learned with the state-of-the-art techniques. We show theoretically that training in ﬂoating-point

arithmetic is hindered by a bound on the number of aﬃne pieces created during an iterative learning

∗

University of Vienna, Faculty of Mathematics. Kolingasse 14–16, 1090 Vienna, Austria.

{clemens.karner,vladimir.kazeev,philipp.petersen}@univie.ac.at

. VK was partially supported by the Austrian

Science Fund (FWF) under project F65 “Taming Complexity in Partial Diﬀerential Systems”.

†University of Vienna, Research Network Data Science. Kolingasse 14–16, 1090 Vienna, Austria.

arXiv:2210.00805v4 [cs.LG] 15 Nov 2023

process. This bound is polynomial with respect to the number of approximation parameters and also

with respect to the number of iterations. Notably, this non-computability of functions with exponentially

many aﬃne pieces is not derived from an abstract result of computability theory, as the related results

established in [

]; instead, we identify a concrete reason why gradient descent based on backpropagation

cannot produce these functions in ﬂoating-point arithmetic. The eﬀect of numerical instability is also

demonstrated in numerical examples. We stress that our results do not imply that theoretically derived

approximation rates cannot be realized with learned functions. Instead, we can merely conclude that the

approximating sequences of neural networks found in practice must be fundamentally diﬀerent than the

theoretically derived ones. We will give a detailed description of our ﬁndings in Subsection 1.3.

This work is strongly inﬂuenced by [

], which shows that functions with exponentially many aﬃne

pieces with respect to the number of layers typically do not appear in randomly initialised neural networks.

The ideas presented in that work form the basis of our analysis.

Before presenting our main results, we recapitulate the notion of ﬂoating-point arithmetic and furnish

an example illustrating how the ﬁnite precision of ﬂoating-point arithmetic can undermine learning due

to the phenomenon known in numerical analysis as catastrophic cancellation.

1.1 Floating-point arithmetic

Computations are performed almost exclusively in binary ﬂoating-point arithmetic, which consists in

restricting the arithmetic of real numbers to a discrete set of the form

M=±2e

k=0

2−kck:c0= 1, c1, . . . , cs∈ {0,1}, e ∈ {emin, . . . , emax},(1.1)

extended with a few special elements, such as zero, inﬁnity and NaN elements. Here, the radix of the

ﬂoating-point arithmetic is two, the parameter

p∈N

is the precision and

emin, emax ∈Z

are the minimum

and maximum exponents. Naturally,

is not closed under basic arithmetic operations such as addition

and multiplication, which necessitates rounding. The round-to-nearest addition and multiplication are

deﬁned so that

|a+Mb−(a+b)|= min

s∈M|s−(a+b)|,|a×Mb−a·b|= min

s∈M|s−ab|(1.2)

for all

a, b ∈M

. The precision

and the above rounding together deﬁne the so-called machine epsilon

1−p

, which is an accurate upper bound for the relative accuracy of the ﬂoating-point addition and

multiplication:

|a+Mb−(a+b)| ≤ ϵ|a+b|and |a×Mb−a·b| ≤ ϵ|a·b|(1.3)

for all

a, b ∈M

. For the double- and single-precision IEEE 754 standards, we have

= 2

−53 ≈

−16

and

= 2

−24 ≈

−8

respectively. Note that computation in ﬂoating point arithmetic amounts to

introducing a relative error in each computation.

1.2 Instability of deep neural networks due to catastrophic cancellation

The bounds

(1.3)

allow for ﬂoating-point operations to be modelled as perturbations of the respective

inﬁnite-precision operations of the real arithmetic.

Consider a feed-forward ReLU neural network with

L∈N

layers with

N0∈N

real inputs and

Nj∈N

neurons in each layer

j∈N

. The evaluation

ϕ:Rd→RNL

of such a neural network at a point

x(0) ∈Rdconsists of iteratively applying the transformations

x(j)=ϱ(Ajx(j−1) +bj)(1.4)

sequentially for

= 1

, . . . , L

, which produces the corresponding output

(

x(0)

) =

x(L)∈RNL

. Here,

ϱ:R→R

given by

(

) =

max{x,

}

for all

x∈R

is the ReLU activation function, which is applied to

the elements of

RNj

with each

j∈ {

, . . . , L}

componentwise, whereas

Aj∈RNj−1×Nj

and

bj∈RNj

with

j∈ {1, . . . , L}are the weight matrices and bias vectors.

For simplicity, let us consider the case of

= 0 for all

j∈ {

, . . . , L}

and assume that

x(0) ∈Rd

and A1, . . . , ALare such that all entries of x(j)are nonnegative for each j∈ {1, . . . , L}. Then

ϕ(x) = AL···A1x, (1.5)

i.e., the evaluation of the network reduced to the multiplication of the input by a matrix product consisting

of Lfactors.

The instability of such products for large

was demonstrated in the context of tensor decompositions,

in [

, Example 3]: the ﬁrst component of the tensor considered therein yields the matrix product

(1.5)

with N0=NL= 1,N1=··· =NL−1= 2 and with weight matrices

A1=(1 + ε1)a

a, AL=(1 + εL)(1 + a−L)a−aand Aj=(1 + εj)a0

0a(1.6)

for each

= 2

, . . . , L −

1, where

a >

1is a ﬁxed parameter and

ε1, . . . , εL≥

0are perturbation

parameters. For all

j∈ {

, . . . , L}

, the case of

ε1

···

εj−1

εj+1

···

εL

= 0 and

εj≥

0gives

x(L)

(

) =

(

x(0)

) = 1 +

εj

(

+ 1)

x(0)

for any

x(0) ∈Rd

. Considering

as a perturbation parameter,

which may be at the level of the machine epsilon

, we observe that a relative perturbation of magnitude

εin the jth weight matrix leads to the error

x(L)(ε)−x(L)(0) = ε(aL+ 1) x(L)(0) ,

which implies the total loss of accuracy for any

ε >

0whenever

aL≳ε−1

. This eﬀect is an immediate

consequence of the subtraction encoded by

, in which the two terms may be perturbed individually

(just as the ﬁrst is perturbed by a multiplicative factor of 1 +

and the second is not perturbed at all in

the above example). This is an archetypal example of the so-called catastrophic cancellation; see, e.g., [

Section 1.7].

This example, based on inaccurate matrix multiplication, can be generalised into the following

statement for deep neural networks. The precise statement is given as Proposition A.3 in the Appendix.

Proposition 1.1. Let

L∈N

and

N∈N

. Let

= 2

r≥

1for

r∈Z

and let

be as in

(1.1)

, with

emin ≤

emax >

53 +

= 53 and hence

= 2

−53

. If (

L−

log2

(

N−

1) + (

L−

log2

(

a−

)

≥

54,

then there is a neural network Φ

a,N,L

with

layers,

neurons per layer, and all weights bounded in

absolute value below by 1 and above by asuch that

|Φa,N,L(x)−Φa,N,L(x)|=|Φa,N,L(x)|=|x|

for all

x≥

0, where

Φa,N,L(x)

is the neural network Φ

a,N,L

evaluated in ﬂoating-point arithmetic with

the machine epsilon ϵ.

In the ﬁnite-precision setting, we observe that the evaluation of very small neural networks can

already lead to a relative error of 1compared to the neural network with arbitrary precision. Admissible

parameters such that the conclusion of Proposition 1.1 holds are, for example,

= 65,

= 10, and

= 8.

Proposition 1.1 considers only the forward application of a neural network. However, the construction

of this neural network is based on the example

(1.6)

and is such that the neural network is entirely linear.

By the backpropagation rule, recalled in Deﬁnition 2.3, the reversal of the neural network constructed in

Proposition 1.1 yields a neural network such that the accurate computation of the gradient of the weights

with respect to the output of the neural network is infeasible in ﬂoating-point arithmetic.

1.3 Contribution

In this work, we analyse the eﬀect of inaccurate computations stemming from ﬂoating-point arithmetic

in a more moderate regime than that of Proposition 1.1. In contrast to the error ampliﬁcation, that was

the basis of Example 1.6, we assume the errors in the following to stem mostly from accumulation of

many inaccurate operations. We assume that, in the computation of the gradient of a neural network,

in each layer, a small and controlled relative error will appear. Under this assumption, we observe that

neural networks trained with gradient descent will, with high probability, not exhibit exponentially many

aﬃne pieces with respect to the number of layers.

We describe a framework for learning in ﬂoating-point arithmetic in Section 2. There, we deﬁne a

gradient descent procedure where the gradient is computed with noisy updates.

To facilitate our main results, we make two assumptions on the gradient descent procedure (Assump-

tions 2.6 and 2.7), which can be intuitively formulated as follows.

(A)

We assume that the average of the reciprocals of the non-zero bias updates in an iteration is bounded

by a polynomial in the maximum number

of neurons in a single layer. Moreover, every neuron

with zero output (dead neuron) is assumed to remain so after the iteration. Both assumptions

together essentially require the gradient of every bias to be either zero or not too small for most

neurons. We verify this assumption numerically in Section 4.

(B)

For all layers

j∈N

, the derivative of each coordinate of the preactivations, i.e.,

Ajx(j−1)

, with

respect to the input of the neural network is bounded by a uniform constant.

Under these assumptions, we prove Theorem 3.4, which can be intuitively phrased as follows: the

expected number of aﬃne pieces that the neural network has after a single perturbed gradient descent

update has an upper bound that is polynomial (in practice quadratic) in the number of neurons, linear in

the number of layers and inversely proportional to the level of perturbation.

The result is based on the following insight of [

]. Consider as a prototypical neuron a function of

the form

(

h−b

), where

is a piecewise aﬃne function of bounded variation. Then, for

(

h−b

)to have

more aﬃne pieces than

it is necessary that

(

) =

for some

in the domain of

. Intuitively, for

many such

to exist,

h−b

has to change sign many times. If

is a random variable, then, for the event

(

) =

to happen with high probability,

needs to repeatedly cross a substantial part of the region

where

has most of its mass. This, however, can only happen to some degree since the variation of

was assumed to be bounded. The resulting estimate of expected newly generated aﬃne pieces in one

neuron under random biases is formalised in Lemma 3.1. Since the ﬂoating-point operations are assumed

to introduce noise into the bias updates in a way quantiﬁed in Deﬁnition 2.3, and since Assumption

(B) implies bounded variation of the outputs of each neuron, we deduce that with high probability the

number of aﬃne pieces in a neural network is bounded.

In a gradient descent iteration, the noise in each gradient update step also depends on the current

empirical risk. Notably, if the neural network is initialised with zero-training error, no update happens.

However, under standard assumptions on the dynamics of the loss, we can still prove that the number of

aﬃne pieces during the iteration does not increase quickly. This is the content of Theorem 3.6, which we

state below in a simpliﬁed and informal version:

Theorem 1.2 (Informal version of Theorem 3.6).During gradient descent, where the gradient is computed

by backpropagation in ﬂoating-point arithmetic, it holds for each iteration with probability 1

2that the

number of aﬃne pieces of a neural network along a given line grows at most polynomially (almost linearly)

with respect to the number of iterations, polynomial in the number of neurons, and linear in the number

of layers.

Assumptions (A) and (B) will be discussed in detail after their formal statement in Section 2. Since

Assumption (A) relates to the distribution of updates, we numerically study if it is satisﬁed in practice in

Section 4.

Finally, in Section 5, we numerically analyse the inﬂuence of the magnitude of round-oﬀ errors. We

observe that, while lower numerical accuracy inﬂuences the generation of aﬃne pieces negatively, as

claimed by our main theorem, the eﬀect is not as pronounced as the direct application of the theorem

may suggest. The reason is that, for the approximations involved, the number of aﬃne pieces is already

signiﬁcantly lower than our upper bounds. Hence we conclude that numerical stability is one reason that

prevents the learning of functions with exponentially many aﬃne pieces, but there may be many more

such reasons that prevent the number of aﬃne pieces from reaching the theoretically established threshold.

If the number of aﬃne pieces is large for the initial neural network, it reduces in the course of training; in

accordance with our theoretical results, that reduction occurs faster for lower computational accuracies.

1.4 Related work

This work represents a counter-point to many approximation theoretical approaches used in the

analysis of deep learning. Here, we study to what extent piecewise-aﬃne functions with a number of aﬃne

pieces that is exponential with respect to the number of layers can be learned by deep neural networks

using gradient descent algorithms. Functions with high numbers of aﬃne pieces are necessary in the

approximation of the square function [

], which forms the basis for many approximation results in the

literature, e.g., [48, 35, 30, 37, 40, 11].

The primary motivation for the approach that we follow is taken from [

]. There it is shown that

randomly initialised neural networks typically have few aﬃne pieces. The main idea of that work is to

show that to generate many aﬃne pieces, the bias vectors in neural networks need to be very accurately

chosen. This is unlikely to be achieved if the biases are initialised randomly. This very idea is also the

basis behind our analysis. We cannot, however, simply apply the main results of [

] since analysing the

whole iterative procedure of gradient descent requires keeping track of the interdependence of the random

variables modelling the weights and biases of the network and estimating their variance throughout the

process.

Floating-point arithmetic and its eﬀect on iterated computations have been studied in the literature

before. The example given by

(1.5)

–

(1.6)

is derived from the study of the stability of long tensor

factorisations in [

]. The explicit constructions of low-rank tensor approximations presented in [

] all, depending on the speciﬁc implementation, may suﬀer from the instability similar to that

demonstrated in (1.5)–(1.6).

In [

], an empirical study is presented that ﬁnds that the accuracy aﬀects the overall performance of a

neural network classiﬁer. In this context, also an eﬀect of the number of layers is measured, showing that

deeper neural networks are more sensitive to low accuracies. Another analysis is given in [

], where it is

shown that numerical inaccuracies introduce stochasticity into the trajectories of trained neural networks,

which is similar to noise in stochastic gradient descent. In the context of ﬁxed-point arithmetic, [

] showed

that in many classiﬁcation problems, one can restrict the computation accuracy signiﬁcantly without

harming the classiﬁcation accuracy. There is also a vast body of literature studying how neural networks

can be eﬃciently quantised to speed up computations or minimise storage requirements [

Typically it is found that moderate quantisation or speciﬁcally designed quantisation does not harm

classiﬁcation accuracies. To the best of our knowledge, the eﬀect on the generation of many aﬃne pieces

has not been studied.

There have also been more abstract studies showing that on digital hardware or in ﬁnite precision

computations, there are learning scenarios that cannot be solved to any reasonable accuracy; see [6, 4].

2 Framework

In this section, we introduce the framework for this article. We start by formally introducing neural

networks in Deﬁnition 2.1. Thereafter, we ﬁx the notion of number of aﬃne pieces of a neural network in

Deﬁnition 2.2. To study the eﬀect of numerical accuracy on the convergence and performance of gradient

descent based learning, we deﬁne the perturbed gradient descent update in Deﬁnition 2.3 and the full

perturbed gradient descent iteration in Deﬁnition 2.5.

To facilitate the results in the sequel, we make Assumptions 2.6 and 2.7 on the gradient descent

dynamics. Assumption 2.6 requires the absolute value of the derivative of the objective function with

respect to each of the biases to either vanish or be bounded from below. In the event that the derivative

vanishes, we assume that the associated neuron is dead, i.e., its output is constant on the whole domain.

Assumption 2.7 stipulates that the output of each neuron should have a bounded derivative with respect

to the input variable. We will discuss the sensibility of these assumptions at the end of this section.

We start by deﬁning a neural network. Here we focus only on the most widely used activation function,

the ReLU, ϱ(x):= max{0, x}, for x∈R.

Deﬁnition 2.1 ([

]).Let

d, L ∈N

. A neural network (NN) with input dimension

and

layers is

a sequence of matrix-vector tuples

Φ:=(A1, b1),(A2, b2),...,(AL, bL),

where

N0:

and

N1, . . . , NL∈N

, and where

Aj∈RNj×Nj−1

and

bj∈RNj

for

= 1

, ..., L

. The number

NLis referred to as the output dimension.

For

j′∈ {

, . . . , L −

}

, the subnetwork from layer 1to layer

j′

of Φis the sequence of matrix-vector

tuples

Φj′:=(A1, b1),(A2, b2),...,(Aj′, bj′).

For a NN Φand a domain Ω⊂Rd, we deﬁne the associated realisation of the NN Φas

R(Φ) : Ω →RNL:x7→ x(L):= R(Φ)(x),

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LimitationsofneuralnetworktrainingduetonumericalinstabilityofbackpropagationClemensKarner∗VladimirKazeev∗†PhilippChristianPetersen∗†November16,2023AbstractWestudythetrainingofdeepneuralnetworksbygradientdescentwherefloating-pointarithmeticisusedtocomputethegradients.Inthisframeworkandunderrealistica...

展开>> 收起<<

Limitations of neural network training due to numerical instability of backpropagation Clemens KarnerVladimir KazeevPhilipp Christian Petersen.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Limitations of neural network training due to numerical instability of backpropagation Clemens KarnerVladimir KazeevPhilipp Christian Petersen

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: