DYNAMICAL SYSTEMS BASED NEURAL NETWORKS ELENA CELLEDONI DAVIDE MURARI BRYNJULF OWREN CAROLA-BIBIANE SCH ONLIEBAND FERDIA SHERRY

2025-05-03 0 0 1.6MB 32 页 10玖币
侵权投诉
DYNAMICAL SYSTEMS’ BASED NEURAL NETWORKS
ELENA CELLEDONI, DAVIDE MURARI , BRYNJULF OWREN, CAROLA-BIBIANE
SCH ¨
ONLIEB,AND FERDIA SHERRY
Abstract. Neural networks have gained much interest because of their effectiveness in many
applications. However, their mathematical properties are generally not well understood. If there is
some underlying geometric structure inherent to the data or to the function to approximate, it is often
desirable to take this into account in the design of the neural network. In this work, we start with
a non-autonomous ODE and build neural networks using a suitable, structure-preserving, numerical
time-discretisation. The structure of the neural network is then inferred from the properties of the
ODE vector field. Besides injecting more structure into the network architectures, this modelling
procedure allows a better theoretical understanding of their behaviour. We present two universal
approximation results and demonstrate how to impose some particular properties on the neural
networks. A particular focus is on 1-Lipschitz architectures including layers that are not 1-Lipschitz.
These networks are expressive and robust against adversarial attacks, as shown for the CIFAR-10
and CIFAR-100 datasets.
Key words. Neural networks, dynamical systems, Lipschitz networks, Structure-preserving
deep learning, Universal approximation theorem.
MSC codes. 65L05, 65L06, 37M15
1. Introduction. Neural networks have been employed to accurately solve many
different tasks (see, e.g., [4,12,54,35]). Indeed, because of their excellent approxima-
tion properties, ability to generalise to unseen data, and efficiency, neural networks are
one of the preferred techniques for the approximation of functions in high-dimensional
spaces.
In spite of this popularity, a substantial number of results and success stories in
deep learning still rely on empirical evidence and more theoretical insight is needed.
Recently, a number of scientific papers on the mathematical foundations of neural
networks have appeared in the literature, [9,71,60,61,66,33]. In a similar spirit,
many authors consider the design of deep learning architectures taking into account
specific mathematical properties such as stability, symmetries, or constraints on the
Lipschitz constant [36,31,26,63,22,27,67,34,69,73]. Even so, the imposition of
structure on neural networks is often done in an ad hoc manner, making the resulting
input to output mapping F:X → Y hard to analyse. In this paper, we describe
a general and systematic way to impose desired mathematical structure on neural
networks leading to an easier approach to their analysis.
There have been multiple attempts to formulate unifying principles for the design
of neural networks. We hereby mention Geometric Deep Learning (see e.g. [13,12]),
Neural ODEs (see e.g. [21,47,57,72]), the continuous-in-time interpretation of Re-
current Neural Networks (see e.g. [58,20]) and of Residual Neural Networks (see
e.g. [71,44,18,59,1]). In this work, we focus on Residual Neural Networks (ResNets)
and build upon their continuous interpretation.
Neural networks are compositions of parametric maps, i.e. we can characterise a neu-
Department of Mathematical Sciences, NTNU, N-7491 Trondheim, Norway
(elena.celledoni@ntnu.no,davide.murari@ntnu.no,brynjulf.owren@ntnu.no)
Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Wilber-
force Road, Cambridge CB3 0WA, UK. (cbs31@cam.ac.uk,fs436@cam.ac.uk)
1
arXiv:2210.02373v2 [cs.LG] 31 Aug 2023
ral network as a map N=fθk. . . fθ1:RnRm, with fθi:RniRni+1 being the
network layers. For ResNets the most important parametric maps are of the form
(1.1) x7→ fθi(x) = x+hΛ(θi, x).
The continuous-in-time interpretation of ResNets arises from the observation that if
ni=ni+1,fθicoincides with one hstep of the explicit Euler method applied to
the non-autonomous ODE ˙x(t) = Λ(θ(t), x(t)). In this work, we consider piecewise-
autonomous systems, i.e. we focus on time-switching systems of the form
(1.2) ˙x(t) = fs(t)(x(t)), s : [0, T ]→ {1, . . . , N}, fi∈ F,
with sbeing piecewise constant (see e.g. [57,41]), and Fa family of parametric vector
functions. This simplification is not restrictive and can help analyse and design neural
networks, as we will clarify throughout the paper.
This interpretation of ResNets gives the skeleton of our reasoning. Indeed, we
replace the explicit Euler method in (1.1) with suitable (structure-preserving) numer-
ical flows of appropriate vector fields. We call the groups of layers obtained with
these numerical flows “dynamical blocks”. The choice of the vector field is closely
related to the structure to impose. For example, to derive symplectic neural net-
works we would apply symplectic time integrators to Hamiltonian vector fields. This
approach enables us to derive new structured networks systematically and collocate
other existing architectures into a more general setting, making their analysis easier.
For instance, section 2 presents a strategy to study the approximation capabilities of
some structured networks. Finally, we highlight the flexibility and the benefits of this
framework in section 3, where we show that to obtain expressive and robust neural
networks, one can also include layers that are not 1-Lipschitz.
There are multiple situations where one could be interested in networks with some
prescribed property. We report three of them here, where we refer to Fas the function
to approximate:
1. When Fhas some known characterising property, e.g. Fis known to be
symplectic; see section 4.
2. When the data we process has a particular structure, e.g. vectors whose entries
sum to one, as we present in section 4.
3. When we can approximate Fto sufficiently high accuracy with functions in
G, a space that is well suited to model the layers of a network. An example
is using the space Gof 1-Lipschitz functions to define a classifier robust to
adversarial attacks; see section 3.
Thus, there are various applications where having neural networks structured in a
particular way is desirable. We will delve deeper into some of them in the following
sections. To be precise, we remark that all the properties we focus on are preserved
under composition, such as being 1-Lipschitz or symplectic.
The paper is structured in five sections. First, in section 2 we investigate the universal
approximation capabilities of some neural networks, thanks to vector field decompo-
sitions, splitting methods and an embedding of the dynamics into larger dimensional
spaces. We then move, in section 3, to a neural network that has the property of being
1-Lipschitz. After the mathematical derivation of the architecture, we present some
numerical experiments on adversarial robustness for the CIFAR-10 and CIFAR-100
image classification problems. We devote a significant part of the experimental side
2
of this paper to examples in the well-established field of adversarial robustness, but
we furthermore provide examples of other desirable structural properties that can be
imposed on neural networks using connections to dynamical systems. In section 4, we
introduce such neural networks with specific designs. This last section aims to pres-
ent in a systematic way how one can impose certain properties on the architecture.
We finally conclude the paper in section 5, mentioning some promising directions for
further work.
Before moving on, we now report a numerical experiment that motivates our investi-
gation of structured neural networks. The results highlight how imposing a structure
does not have to degrade the approximation’s quality considerably. Furthermore, this
experiment suggests that not all the constraining strategies perform similarly, as we
also highlight in section 3. Thus, a systematic process to impose structure is essen-
tial since it allows changing the architecture in a guided manner while preserving the
property of interest.
1.1. Classification of points in the plane. We present a numerical experi-
ment for the classification problem of the dataset in Figure 1b. We consider neural
networks that are 1-Lipschitz, as in section 3. We define the network layers alternat-
ingly as contractive flow maps, whose vector fields belong to Fc={−ATΣ(Ax +b) :
ATA=I}, and as flows of other Lipschitz continuous vector fields in F={Σ(Ax+b) :
ATA=I}, with Σ(z)=[σ(z1), . . . , σ(zn)] and σ(s) = max{s, s/2}1. In section 3 we
expand on the choice of this activation function σ, which is called LeakyReLU and
was introduced in [45]. The time steps for each vector field are network parameters,
together with the matrices Aand vectors b. We constrain the time steps to get a
1-Lipschitz network, see section 3. We report the results in Figure 1a and Table 1.
The average classification test accuracy and final integration time, in combination,
get better by combining Fcwith Finstead of considering Fcalone. In particular, we
see that the final integration time Twith FcF is the smallest without significantly
compromising the accuracy. The parameter Tquantifies how much the network layers
transform the points. The larger the timestep, the further a layer is from the identity
map; hence we can get a more natural and efficient solution by alternating the vector
fields. In section 2 we reinforce this empirical result, proving results about theoretical
approximation guarantees. This renders the possibility of obtaining neural networks
with prescribed properties without compromising their approximation capabilities.
Adopted family of vector fields Median accuracy Median of T
F ∪ Fc98.0% 1.84
F99.0% 7.53
Fc97.3% 19.82
Table 1: We perform 100 experiments alternating vector fields in Fcwith those in
F, 100 using just vector fields in Fc, and 100 with only those in F. We work with
networks with ten residual layers throughout the experiments. In the table we report
the median final time Tand test accuracy for the three set of experiments analysed
1To impose the weight orthogonality, we set A= expm(WWT) with expm being the matrix
exponential and Wa trainable matrix.
3
F Fc
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Average activation time
Median test accuracy: 98.0%
(a) Mean length of the time intervals along
which the two dynamical regimes are active.
3210123
x
3
2
1
0
1
2
3
4
y
Points to classify
(b) Dataset studied for the classification
problem
Fig. 1: Results from the experiments alternating the vector fields of Fand those of
Fc, together with the dataset of interest
2. Universal approximation properties. As introduced before, neural net-
work layers can be modelled by discretising ordinary differential equations. In par-
ticular, this ODE-based approach can also be beneficial for imposing some structure
on neural networks and providing a better theoretical understanding of their proper-
ties. In this section, we follow this principle and prove the universal approximation
capabilities of two neural network architectures. Starting with the continuous-in-
time interpretation of neural networks, many approaches are possible to prove such
approximation properties, often based on merging the broad range of results from dy-
namical systems theory and numerical analysis. One can proceed, for example, in a
constructive way as done in [57], where the authors investigate the dynamics of some
neural networks and explicitly construct solutions to the problem of approximating
a target function. Another possibility is to study the approximation capabilities of
compositions of flow maps, as done in [40]. In this section, we focus on two solutions
that, to the best of our knowledge, are new. The first result is based on a vector
field decomposition, while the second is based on embedding vector fields into larger
dimensional spaces.
The approximation results that we cover rely on approximating vector fields arbi-
trarily well. Consequently, this allows us to approximate their flow maps accurately.
This is based on the fact that for a sufficiently regular vector field XLip(Rn,Rn),
if ˜
X:RnRnis such that for every xRn
(2.1) X(x)˜
X(x)< ε,
then also their flow maps are close to one another for finite time intervals. We formalise
this reasoning in Proposition 2.1. In the proposition and throughout the paper, we
denote with Φt
X(z) the time-tflow of the vector field X, applied to z.
Proposition 2.1. Let XLip(Rn,Rn)and ˜
XLip(Rn,Rn)be as in (2.1).
4
Then Φt
X(x)Φt˜
X(x)∥ ≤ εt exp (Lip(X)t),where Lip(X)is the Lipschitz constant
of X.
Proof. We consider the integral equations associated to the ODEs ˙x(t) = X(x(t))
and ˙
˜x(t) = ˜
X(˜x(t)) and study the difference of their solutions both with the same
initial condition xRn
Φt
X(x)Φt˜
X(x)=
x+Zt
0
Xs
X(x)) dsxZt
0
˜
Xs
˜
X(x)) ds
Zt
0
Xs
X(x)) ˜
Xs
˜
X(x))
ds
=Zt
0
Xs
X(x)) Xs
˜
X(x)) + Xs
˜
X(x)) ˜
Xs
˜
X(x))
ds
Lip(X)Zt
0Φs
X(x)Φs
˜
X(x)ds+εt.
Then we conclude that
Φt
X(x)Φt˜
X(x)∥ ≤ εt exp (Lip(X)t).
applying Gronwall’s inequality.
A particular consequence of this proposition is that if for every ε > 0 there is an
˜
X∈ F making (2.1) true, then we can approximate the Tflow map of Xarbitrarily
well using elements of F:
ΦT
X(x)ΦT
˜
X(x)∥ ≤ εT exp(Lip(X)T) = cε.
Because of this result, we now derive two approximation results for neural networks
working at the level of modelling vector fields.
2.1. Approximation based on a vector field decomposition. We now aim
to show that, for a particularly designed neural network, we can approximate arbi-
trarily well any continuous function in the Lpnorm and any differentiable invertible
function in the supremum norm on compact sets. We also mention how to extend
this last result to generic continuous functions.
Theorem 2.2. Let F: Ω RnRnbe a continuous function, with Rna
compact set. Suppose that it can be approximated, with respect to some norm ·, by
a composition of flow maps of C1(Ω,Rn)vector fields, i.e. for any ε > 0,f1, . . . , fk
C1(Ω,Rn), such that
(2.2) FΦhk
fk. . . Φh1
f1< ε.
Then, Fcan be approximated arbitrarily well by composing flow maps of gradient and
sphere-preserving vector fields, i.e. FΦhk
UkΦhk
Xk
S. . . Φh1
U1Φh1
X1
S< ε.
By sphere-preserving vector field, we mean a vector field XShaving zTzas a first
integral, i.e. such that zTXS(z) = 0 for any zRn.
The norm ∥·∥in (2.2) can be any norm that is well defined for functions in
C1(Ω,Rn). Two typical choices in the literature are Lpnorms and the supremum
norm
(2.3) FΦhk
fk. . . Φh1
f1:= sup
xF(x)Φhk
fk. . . Φh1
f1(x).
5
摘要:

DYNAMICALSYSTEMS’BASEDNEURALNETWORKSELENACELLEDONI∗,DAVIDEMURARI∗,BRYNJULFOWREN∗,CAROLA-BIBIANESCH¨ONLIEB†,ANDFERDIASHERRY†Abstract.Neuralnetworkshavegainedmuchinterestbecauseoftheireffectivenessinmanyapplications.However,theirmathematicalpropertiesaregenerallynotwellunderstood.Ifthereissomeunderlyi...

展开>> 收起<<
DYNAMICAL SYSTEMS BASED NEURAL NETWORKS ELENA CELLEDONI DAVIDE MURARI BRYNJULF OWREN CAROLA-BIBIANE SCH ONLIEBAND FERDIA SHERRY.pdf

共32页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:32 页 大小:1.6MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 32
客服
关注