DYNAMICAL SYSTEMS BASED NEURAL NETWORKS ELENA CELLEDONI DAVIDE MURARI BRYNJULF OWREN CAROLA-BIBIANE SCH ONLIEBAND FERDIA SHERRY

2025-05-03 0 0 1.6MB 32 页 10玖币

侵权投诉

DYNAMICAL SYSTEMS’ BASED NEURAL NETWORKS

ELENA CELLEDONI∗, DAVIDE MURARI ∗, BRYNJULF OWREN∗, CAROLA-BIBIANE

SCH ¨

ONLIEB†,AND FERDIA SHERRY†

Abstract. Neural networks have gained much interest because of their eﬀectiveness in many

applications. However, their mathematical properties are generally not well understood. If there is

some underlying geometric structure inherent to the data or to the function to approximate, it is often

desirable to take this into account in the design of the neural network. In this work, we start with

a non-autonomous ODE and build neural networks using a suitable, structure-preserving, numerical

time-discretisation. The structure of the neural network is then inferred from the properties of the

ODE vector ﬁeld. Besides injecting more structure into the network architectures, this modelling

procedure allows a better theoretical understanding of their behaviour. We present two universal

approximation results and demonstrate how to impose some particular properties on the neural

networks. A particular focus is on 1-Lipschitz architectures including layers that are not 1-Lipschitz.

These networks are expressive and robust against adversarial attacks, as shown for the CIFAR-10

and CIFAR-100 datasets.

Key words. Neural networks, dynamical systems, Lipschitz networks, Structure-preserving

deep learning, Universal approximation theorem.

MSC codes. 65L05, 65L06, 37M15

1. Introduction. Neural networks have been employed to accurately solve many

diﬀerent tasks (see, e.g., [4,12,54,35]). Indeed, because of their excellent approxima-

tion properties, ability to generalise to unseen data, and eﬃciency, neural networks are

one of the preferred techniques for the approximation of functions in high-dimensional

spaces.

In spite of this popularity, a substantial number of results and success stories in

deep learning still rely on empirical evidence and more theoretical insight is needed.

Recently, a number of scientiﬁc papers on the mathematical foundations of neural

networks have appeared in the literature, [9,71,60,61,66,33]. In a similar spirit,

many authors consider the design of deep learning architectures taking into account

speciﬁc mathematical properties such as stability, symmetries, or constraints on the

Lipschitz constant [36,31,26,63,22,27,67,34,69,73]. Even so, the imposition of

structure on neural networks is often done in an ad hoc manner, making the resulting

input to output mapping F:X → Y hard to analyse. In this paper, we describe

a general and systematic way to impose desired mathematical structure on neural

networks leading to an easier approach to their analysis.

There have been multiple attempts to formulate unifying principles for the design

of neural networks. We hereby mention Geometric Deep Learning (see e.g. [13,12]),

Neural ODEs (see e.g. [21,47,57,72]), the continuous-in-time interpretation of Re-

current Neural Networks (see e.g. [58,20]) and of Residual Neural Networks (see

e.g. [71,44,18,59,1]). In this work, we focus on Residual Neural Networks (ResNets)

and build upon their continuous interpretation.

Neural networks are compositions of parametric maps, i.e. we can characterise a neu-

∗Department of Mathematical Sciences, NTNU, N-7491 Trondheim, Norway

(elena.celledoni@ntnu.no,davide.murari@ntnu.no,brynjulf.owren@ntnu.no)

†Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Wilber-

force Road, Cambridge CB3 0WA, UK. (cbs31@cam.ac.uk,fs436@cam.ac.uk)

arXiv:2210.02373v2 [cs.LG] 31 Aug 2023

ral network as a map N=fθk◦. . . ◦fθ1:Rn→Rm, with fθi:Rni→Rni+1 being the

network layers. For ResNets the most important parametric maps are of the form

(1.1) x7→ fθi(x) = x+hΛ(θi, x).

The continuous-in-time interpretation of ResNets arises from the observation that if

ni=ni+1,fθicoincides with one h−step of the explicit Euler method applied to

the non-autonomous ODE ˙x(t) = Λ(θ(t), x(t)). In this work, we consider piecewise-

autonomous systems, i.e. we focus on time-switching systems of the form

(1.2) ˙x(t) = fs(t)(x(t)), s : [0, T ]→ {1, . . . , N}, fi∈ F,

with sbeing piecewise constant (see e.g. [57,41]), and Fa family of parametric vector

functions. This simpliﬁcation is not restrictive and can help analyse and design neural

networks, as we will clarify throughout the paper.

This interpretation of ResNets gives the skeleton of our reasoning. Indeed, we

replace the explicit Euler method in (1.1) with suitable (structure-preserving) numer-

ical ﬂows of appropriate vector ﬁelds. We call the groups of layers obtained with

these numerical ﬂows “dynamical blocks”. The choice of the vector ﬁeld is closely

related to the structure to impose. For example, to derive symplectic neural net-

works we would apply symplectic time integrators to Hamiltonian vector ﬁelds. This

approach enables us to derive new structured networks systematically and collocate

other existing architectures into a more general setting, making their analysis easier.

For instance, section 2 presents a strategy to study the approximation capabilities of

some structured networks. Finally, we highlight the ﬂexibility and the beneﬁts of this

framework in section 3, where we show that to obtain expressive and robust neural

networks, one can also include layers that are not 1-Lipschitz.

There are multiple situations where one could be interested in networks with some

prescribed property. We report three of them here, where we refer to Fas the function

to approximate:

1. When Fhas some known characterising property, e.g. Fis known to be

symplectic; see section 4.

2. When the data we process has a particular structure, e.g. vectors whose entries

sum to one, as we present in section 4.

3. When we can approximate Fto suﬃciently high accuracy with functions in

G, a space that is well suited to model the layers of a network. An example

is using the space Gof 1-Lipschitz functions to deﬁne a classiﬁer robust to

adversarial attacks; see section 3.

Thus, there are various applications where having neural networks structured in a

particular way is desirable. We will delve deeper into some of them in the following

sections. To be precise, we remark that all the properties we focus on are preserved

under composition, such as being 1-Lipschitz or symplectic.

The paper is structured in ﬁve sections. First, in section 2 we investigate the universal

approximation capabilities of some neural networks, thanks to vector ﬁeld decompo-

sitions, splitting methods and an embedding of the dynamics into larger dimensional

spaces. We then move, in section 3, to a neural network that has the property of being

1-Lipschitz. After the mathematical derivation of the architecture, we present some

numerical experiments on adversarial robustness for the CIFAR-10 and CIFAR-100

image classiﬁcation problems. We devote a signiﬁcant part of the experimental side

of this paper to examples in the well-established ﬁeld of adversarial robustness, but

we furthermore provide examples of other desirable structural properties that can be

imposed on neural networks using connections to dynamical systems. In section 4, we

introduce such neural networks with speciﬁc designs. This last section aims to pres-

ent in a systematic way how one can impose certain properties on the architecture.

We ﬁnally conclude the paper in section 5, mentioning some promising directions for

further work.

Before moving on, we now report a numerical experiment that motivates our investi-

gation of structured neural networks. The results highlight how imposing a structure

does not have to degrade the approximation’s quality considerably. Furthermore, this

experiment suggests that not all the constraining strategies perform similarly, as we

also highlight in section 3. Thus, a systematic process to impose structure is essen-

tial since it allows changing the architecture in a guided manner while preserving the

property of interest.

1.1. Classiﬁcation of points in the plane. We present a numerical experi-

ment for the classiﬁcation problem of the dataset in Figure 1b. We consider neural

networks that are 1-Lipschitz, as in section 3. We deﬁne the network layers alternat-

ingly as contractive ﬂow maps, whose vector ﬁelds belong to Fc={−ATΣ(Ax +b) :

ATA=I}, and as ﬂows of other Lipschitz continuous vector ﬁelds in F={Σ(Ax+b) :

ATA=I}, with Σ(z)=[σ(z1), . . . , σ(zn)] and σ(s) = max{s, s/2}1. In section 3 we

expand on the choice of this activation function σ, which is called LeakyReLU and

was introduced in [45]. The time steps for each vector ﬁeld are network parameters,

together with the matrices Aand vectors b. We constrain the time steps to get a

1-Lipschitz network, see section 3. We report the results in Figure 1a and Table 1.

The average classiﬁcation test accuracy and ﬁnal integration time, in combination,

get better by combining Fcwith Finstead of considering Fcalone. In particular, we

see that the ﬁnal integration time Twith Fc∪F is the smallest without signiﬁcantly

compromising the accuracy. The parameter Tquantiﬁes how much the network layers

transform the points. The larger the timestep, the further a layer is from the identity

map; hence we can get a more natural and eﬃcient solution by alternating the vector

ﬁelds. In section 2 we reinforce this empirical result, proving results about theoretical

approximation guarantees. This renders the possibility of obtaining neural networks

with prescribed properties without compromising their approximation capabilities.

Adopted family of vector ﬁelds Median accuracy Median of T

F ∪ Fc98.0% 1.84

F99.0% 7.53

Fc97.3% 19.82

Table 1: We perform 100 experiments alternating vector ﬁelds in Fcwith those in

F, 100 using just vector ﬁelds in Fc, and 100 with only those in F. We work with

networks with ten residual layers throughout the experiments. In the table we report

the median ﬁnal time Tand test accuracy for the three set of experiments analysed

1To impose the weight orthogonality, we set A= expm(W−WT) with expm being the matrix

exponential and Wa trainable matrix.

F Fc

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Average activation time

Median test accuracy: 98.0%

(a) Mean length of the time intervals along

which the two dynamical regimes are active.

−3−2−10123

−3

−2

−1

Points to classify

(b) Dataset studied for the classiﬁcation

problem

Fig. 1: Results from the experiments alternating the vector ﬁelds of Fand those of

Fc, together with the dataset of interest

2. Universal approximation properties. As introduced before, neural net-

work layers can be modelled by discretising ordinary diﬀerential equations. In par-

ticular, this ODE-based approach can also be beneﬁcial for imposing some structure

on neural networks and providing a better theoretical understanding of their proper-

ties. In this section, we follow this principle and prove the universal approximation

capabilities of two neural network architectures. Starting with the continuous-in-

time interpretation of neural networks, many approaches are possible to prove such

approximation properties, often based on merging the broad range of results from dy-

namical systems theory and numerical analysis. One can proceed, for example, in a

constructive way as done in [57], where the authors investigate the dynamics of some

neural networks and explicitly construct solutions to the problem of approximating

a target function. Another possibility is to study the approximation capabilities of

compositions of ﬂow maps, as done in [40]. In this section, we focus on two solutions

that, to the best of our knowledge, are new. The ﬁrst result is based on a vector

ﬁeld decomposition, while the second is based on embedding vector ﬁelds into larger

dimensional spaces.

The approximation results that we cover rely on approximating vector ﬁelds arbi-

trarily well. Consequently, this allows us to approximate their ﬂow maps accurately.

This is based on the fact that for a suﬃciently regular vector ﬁeld X∈Lip(Rn,Rn),

if ˜

X:Rn→Rnis such that for every x∈Rn

(2.1) ∥X(x)−˜

X(x)∥< ε,

then also their ﬂow maps are close to one another for ﬁnite time intervals. We formalise

this reasoning in Proposition 2.1. In the proposition and throughout the paper, we

denote with Φt

X(z) the time-tﬂow of the vector ﬁeld X, applied to z.

Proposition 2.1. Let X∈Lip(Rn,Rn)and ˜

X∈Lip(Rn,Rn)be as in (2.1).

Then ∥Φt

X(x)−Φt˜

X(x)∥ ≤ εt exp (Lip(X)t),where Lip(X)is the Lipschitz constant

of X.

Proof. We consider the integral equations associated to the ODEs ˙x(t) = X(x(t))

and ˙

˜x(t) = ˜

X(˜x(t)) and study the diﬀerence of their solutions both with the same

initial condition x∈Rn

∥Φt

X(x)−Φt˜

X(x)∥=



x+Zt

X(Φs

X(x)) ds−x−Zt

X(Φs

X(x)) ds



≤Zt

0



X(Φs

X(x)) −˜

X(Φs

X(x))



ds

=Zt

0



X(Φs

X(x)) −X(Φs

X(x)) + X(Φs

X(x)) −˜

X(Φs

X(x))



ds

≤Lip(X)Zt

0∥Φs

X(x)−Φs

X(x)∥ds+εt.

Then we conclude that

∥Φt

X(x)−Φt˜

X(x)∥ ≤ εt exp (Lip(X)t).

applying Gronwall’s inequality.

A particular consequence of this proposition is that if for every ε > 0 there is an

X∈ F making (2.1) true, then we can approximate the T−ﬂow map of Xarbitrarily

well using elements of F:

∥ΦT

X(x)−ΦT

X(x)∥ ≤ εT exp(Lip(X)T) = cε.

Because of this result, we now derive two approximation results for neural networks

working at the level of modelling vector ﬁelds.

2.1. Approximation based on a vector ﬁeld decomposition. We now aim

to show that, for a particularly designed neural network, we can approximate arbi-

trarily well any continuous function in the Lpnorm and any diﬀerentiable invertible

function in the supremum norm on compact sets. We also mention how to extend

this last result to generic continuous functions.

Theorem 2.2. Let F: Ω ⊂Rn→Rnbe a continuous function, with Ω⊂Rna

compact set. Suppose that it can be approximated, with respect to some norm ∥·∥, by

a composition of ﬂow maps of C1(Ω,Rn)vector ﬁelds, i.e. for any ε > 0,∃f1, . . . , fk∈

C1(Ω,Rn), such that

(2.2) ∥F−Φhk

fk◦. . . ◦Φh1

f1∥< ε.

Then, Fcan be approximated arbitrarily well by composing ﬂow maps of gradient and

sphere-preserving vector ﬁelds, i.e. ∥F−Φhk

∇Uk◦Φhk

S◦. . . ◦Φh1

∇U1◦Φh1

S∥< ε.

By sphere-preserving vector ﬁeld, we mean a vector ﬁeld XShaving zTzas a ﬁrst

integral, i.e. such that zTXS(z) = 0 for any z∈Rn.

The norm ∥·∥in (2.2) can be any norm that is well deﬁned for functions in

C1(Ω,Rn). Two typical choices in the literature are Lpnorms and the supremum

norm

(2.3) ∥F−Φhk

fk◦. . . ◦Φh1

f1∥:= sup

x∈Ω∥F(x)−Φhk

fk◦. . . ◦Φh1

f1(x)∥.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DYNAMICALSYSTEMS’BASEDNEURALNETWORKSELENACELLEDONI∗,DAVIDEMURARI∗,BRYNJULFOWREN∗,CAROLA-BIBIANESCH¨ONLIEB†,ANDFERDIASHERRY†Abstract.Neuralnetworkshavegainedmuchinterestbecauseoftheireffectivenessinmanyapplications.However,theirmathematicalpropertiesaregenerallynotwellunderstood.Ifthereissomeunderlyi...

展开>> 收起<<

DYNAMICAL SYSTEMS BASED NEURAL NETWORKS ELENA CELLEDONI DAVIDE MURARI BRYNJULF OWREN CAROLA-BIBIANE SCH ONLIEBAND FERDIA SHERRY.pdf

共32页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DYNAMICAL SYSTEMS BASED NEURAL NETWORKS ELENA CELLEDONI DAVIDE MURARI BRYNJULF OWREN CAROLA-BIBIANE SCH ONLIEBAND FERDIA SHERRY

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: