Designing Universal Causal Deep Learning Models The Case of Infinite-Dimensional Dynamical Systems from Stochastic Analysis Luca Galimberti Anastasis Kratsios Giulia Livieri

2025-05-06 0 0 2.14MB 38 页 10玖币
侵权投诉
Designing Universal Causal Deep Learning Models: The Case of Infinite-Dimensional
Dynamical Systems from Stochastic Analysis
Luca Galimberti ·Anastasis Kratsios ·Giulia Livieri
the date of receipt and acceptance should be inserted later
Abstract Several non-linear operators in stochastic analysis, such as solution maps to stochastic differential equa-
tions, depend on a temporal structure which is not leveraged by contemporary neural operators designed to ap-
proximate general maps between Banach space. This paper therefore proposes an operator learning solution to
this open problem by introducing a deep learning model-design framework that takes suitable infinite-dimensional
linear metric spaces, e.g. Banach spaces, as inputs and returns a universal sequential deep learning model adapted
to these linear geometries specialized for the approximation of operators encoding a temporal structure. We call
these models Causal Neural Operators. Our main result states that the models produced by our framework can
uniformly approximate on compact sets and across arbitrarily finite-time horizons Hölder or smooth trace class op-
erators, which causally map sequences between given linear metric spaces. Our analysis uncovers new quantitative
relationships on the latent state-space dimension of Causal Neural Operators, which even have new implications for
(classical) finite-dimensional Recurrent Neural Networks. In addition, our guarantees for recurrent neural networks
are tighter than the available results inherited from feedforward neural networks when approximating dynamical
systems between finite-dimensional spaces.
Keywords Universal Approximation, Causality, Operator Learning, Linear Widths.
Mathematics Subject Classification (2020) MSC 68T07 ·MSC 9108 ·37A50 ·65C30 ·60G35 ·41A65
1 Introduction
Infinite-dimensional (non-linear) dynamical systems play a central role in several sciences, especially for disciplines
driven by stochastic analytic modeling. However, despite this fact, the causal neural network approximation theory
for most relevant dynamical systems in stochastic analysis is lacking. Indeed, we currently only comprehend neural
network approximations of stochastic differential equations (SDEs) with deterministic coefficients (e.g., [43]) and
time-invariant random dynamical systems with the fading memory and echo state property/unique solution property
(e.g., [79,44]). A significant problem is causal neural network approximation of solution operators to non-Markovian
SDEs.
Moreover, the understanding of how sequential DL models work is still not fully developed, even in the classical
finite-dimensional setting. For instance, the seemingly elementary empirical fact that a sequential DL model’s
expressiveness increases when one utilizes a high-dimensional latent state space is understood qualitatively for
general dynamical systems on Euclidean spaces (as in the reservoir computing literature (e.g., [41])).
L. Galimberti
King’s College London
Department of Mathematics
Strand Building, Strand, London, WC2R 2LS
E-mail: luca.galimberti@kcl.ac.uk
A. Kratsios
McMaster University and The Vector Institute
Department of Mathematics
1280 Main Street West, Hamilton, Ontario, L8S 4K1, Canada
E-mail: kratsioa@mcmaster.ca
G. Livieri
London School of Economics (LSE)
Department of Statistics
Columbia House, Houghton Street, London, WC2A 2AE
E-mail: g.livieri@lse.ac.uk
arXiv:2210.13300v3 [math.DS] 10 Apr 2025
2 Luca Galimberti et al.
However, the quantitative understanding of the relationship between a sequential learning model’s state and
its expressiveness remains an open problem. One notable exception to this fact is the approximation of linear
state-space dynamical systems by a stylized class of Recurrent Neural Networks (RNNs, henceforth); see [56,77].
Our contribution. Our paper provides a simple quantitative solution to a far reaching generalization of the above
problem of constructing neural network approximation of infinite-dimensional (generalized) dynamical systems on
“good” linear metric spaces. More precisely, we construct a neural network approximation of any function fthat
“causally” and “regularly” maps sequences (xtn)
n=−∞ to sequences (ytn)
n=−∞, where each xtnand every ytnlives
in a suitable linear metric space. In particular, we construct our causal neural network approximation framework
on the following desiderata:
(D1) Predictions are causal, i.e., each ytnis predicted independently of (xtm)m>n.
(D2) Each ytnis predicted with a small neural network specialized at time tn.
(D3) Only one of these specialized networks is stored in working memory at a time.
We first begin by describing our causal neural network model’s design. Subsequently, we will discuss our ap-
proximation theory’s implications in computational stochastic analysis.
Update Neural Filters
Hypernetwork
Neural Filter Parameters
Time
In
Out
Fig. 1: The Causal Neural Operator Model:
Summary: An universal approximator of regular causal sequences of operators between well-behaved Fréchet spaces.
Overview: The model successively applies a “universal” neural filter (see Figure 2) on consecutive time-windows; the internal param-
eters of this neural filter evolve according to a latent dynamical system on the neural filter’s parameter space; implemented by a deep
ReLU network called a hypernetwork.
Our neural network model, which we call the Causal Neural Operator (CNO, henceforth) is illustrated in Figure 1
and works in the following way. At any given time tn, it predicts an instance of the output time-series at that time
tnusing an immediate time-window from the input time-series (e.g., it predicts each ytnusing only (xti)n
i=n10).
At each time tn, this prediction is generated by a non-linear operator defined by a finitely parameterized neural
network model, called a neural filter (the vertical black arrows in Figure 1). Our neural network model stores only
one neural filter’s parameters in working memory at the current time by using an auxiliary deep ReLU neural
network, called a hypernetwork in the machine learning literature (e.g., [47,103]), to generate the next neural filter
specialized at tn+1 using only the parameters of the current “active” neural filter specialized at time tn(the blue box
in Figure 1). Thus, a dynamical system (i.e., the hypernetwork) on the neural filter’s parameter space interpolating
between each neural filter’s parameters encodes our entire model.
The principal approximation-theoretic advantage of this approach lies in the fact that the hypernetwork is not
designed to approximate anything, but rather, it only needs to memorize/interpolate a finite number of finite-
dimensional (parameter) vectors. Since memorization (e.g., [102,68,52]) requires only a polynomial number of
parameters to achieve zero approximation error on a finite set, while approximation (e.g., [108,69,109,70]) requires
an exponential number of parameters to achieve a possibly non-zero error over a large set containing the finite set
of interest, then, leveraging memorization yields both lighter (fewer parameters) and more accurate deep learning
models; that is, the constructed neural network model is exponentially more efficient. In particular, using a neural
network for memorization allows the trained DL model to generalize beyond the data it is interpolating, a capability
Designing Universal Causal Deep Learning Models 3
that a simple list does not possess. When both the input and output spaces are finite-dimensional, our models
effectively reduce to RNNs, which are known for their ability to generalize beyond their training data [104]. This
generalization is attributed to factors such as having a finite VC (Vapnik-Chervonenkis) dimension [65,94] or finite
Rademacher complexity [58]. Thus, this neural network design allows us to successfully encode all the parameters
required to approximate long stretches of time {t0, . . . , tN}(for large N) with far fewer parameters (i.e., at the cost
of O(log(N)) additional layers in the hypernetwork). Thus, we successfully achieve desiderata (D1)(D3) provided
that each neural filter relies on only a small number of parameters. We show that this is the case whenever fis
“sufficiently smooth”; the rigorous formulation of all these outlined ideas are expressed in Lemma 5and Theorem
2.
... ...
...
... ...
5
-9
0
x 5
x -9
x 0
x 9
x -2
x -6
x 15
Neural Filter
In Out
Fig. 2: The Neural Filter
Summary: An universal approximator of regular maps between any well-behaved Fréchet spaces.
Overview: The neural filter first encodes inputs from a (possibly infinite-dimensional) linear space by approximately representing
the input as coefficients of a sparse (Schauder) basis. These basis coefficients are then transformed by a deep ReLU network and the
network’s outputs are decoded by the coefficients of a sparse basis representation of an element of the output linear space. Assembling
the basis using the outputted coefficients produces the neural filter’s output.
Though we are focused on the approximation theoretic properties of our modeling framework, we have designed
our CNO by accounting for practical considerations. Namely, we intentionally designed the CNO model so that, like
transformer networks [101], it can be trained non-recursively (via our federated training algorithm, see Algorithm 1
below). This design choice is motivated by the main reasons why the transformer network model (e.g., [101]) has
replaced residual (e.g., [49]) and RNN (especially Long Short-Term Memory (LSTMs, henceforth) [51]) counterparts
in practice (e.g., [53,106]); namely, not back-propagating through time during training. The reason is that omitting
any recurrence relation between a model’s prediction in sequential prediction tasks, at-least during the model’s
construction, has been empirically confirmed to yield more reliable and accurate models trained faster and without
vanishing or exploding gradient problems; see, e.g., [50,88]. Nevertheless, our model does ultimately reap the
benefits of recursive models even if we construct it non-recursively, using our parallelizable training procedure.
The neural filter, illustrated in Figure 2, is a neural operator with quantitative universal approximation guar-
antees far beyond the Hilbert space setting. It works by first encoding infinite-dimensional problems into finite-
dimensions problems. It then predicts outputs by passing the truncated basis coefficients through a feed-forward
neural network with trainable (P)ReLU activation function. Finally, it reassembles them in the output space by
interpreting the network’s outputs as the coefficients of a pre-specified Schauder basis or if both spaces are repro-
ducing kernel Hilbert spaces then the first few basis functions can learned from data using principal component
analysis1, e.g. as with PCA–Net [74]. A similar encoding-MLP-decoding scheme was also used in [21] for approx-
imately solving nonlinear Kolmogorov equations on Hilbert spaces. We also note that some infinite-dimensional
deep learning models between function spaces on Euclidean domains, such as the DeepONet architecture of [78],
replace the basis vectors with trainable deep neural networks; however, this technique does not readily apply to
general Fréchet spaces.
1Or a robust version thereof, e.g. [39] and then normalizing and orthogonalizing via Gram-Schmidt.
4 Luca Galimberti et al.
Our “static” approximation theorems provides quantitative approximation guarantees for several “neural oper-
ators” used in practice, especially in the numerical Partial Differential Equations (PDEs), e.g., [61], and in the
inverse-problem literature, e.g., [2,18,3,19,28]. In the static case, the same argument is valid also for the general
qualitative (rate-free) approximation theorems of [97,12,72].
We now describe more in detail the different areas in which the present paper contributes.
Our contribution in the Approximation Theory of Neural Operators. Our results provide the first set of quantitative
approximation guarantees for generalized dynamical systems evolving on general infinite-dimensional spaces. By
refining the memorizing hypernetwork argument of [1], together with our general solution to the static universal
approximation problem, in the class of Hölder functions2, we are able to confirm a well-known folklore approximation
of dynamical systems literature. Namely, that increasing a sequential neural operator’s latent space’s dimension by
a positive integer Qand our neural network’s depth3by ˜
O(TQlog(TQ)) and width by ˜
O(QT Q)implies that
we may approximate O(T)more time-steps in the future with the same prescribed approximation error.
To the best of our knowledge, our dynamic result is the only quantitative universal approximation theorem
guaranteeing that a recurrent neural network model can approximate any suitably regular infinite-dimensional
non-linear dynamical systems. Likewise, our static result is to the best of our knowledge the only general infinite-
dimensional guarantee showing that a neural operator enjoys favourable approximation rates when the target map
is smooth enough.
Our contribution in the Approximation Theory of RNNs In the finite-dimensional context, CNOs become strict
sub-structures of full RNNs, where the internal parameters are updated/generated via an auxiliary hypernetwork.
Noticing this structural inclusion, our results rigorously support the folklore that RNNs may be more suitable when
approximating causal maps, than feedforward neural network (FFNN, henceforth), see Section 5. This is because
our theory yields expression rates for RNN approximations of causal maps between finite-dimensional spaces, which
are more efficient than currently available comparable rates for FFNNs.
Technical contributions: Our results apply to sequences of non-linear operators between any “good linear” metric
spaces. By “good linear” metric space we mean any Fréchet space admitting Schauder basis. This includes many
natural examples (e.g., the sequence space RNwith its usual metric) outside the scope of the Banach, Hilbert4
spaces carrying Schauder basis and Euclidean settings; which are completely subsumed by our assumptions. In
other words, we treat the most general tractable linear setting where one can hope to obtain quantitative universal
approximation theorems.
Organization of our paper This research project answers theoretical deep learning questions by combining tools
from approximation theory, functional analysis, and stochastic analysis. Therefore, we provide a concise exposition
of each of the relevant tools from these areas in our “preliminaries” Section 2.
Section 3contains our quantitative universal approximation theorems. In the static case, we derive expression
rates for the static component of our model, namely the neural filters, which depend on the regularity of the target
operator being approximated; from Hölder trace-class to smooth trace-class and on the usual quantities5. Our main
approximation theorem in the dynamic case additionally encodes the target causal map’s memory decay rate.
Section 4.2 applies our main results to derive approximation guarantees for the solution operators of a broad
range of SDEs with stochastic coefficients, possibly having jumps (“stochastic discontinuities”) at times on a pre-
specified time-grid and with initial random noise. Section 5, examines the implication of our approximation rates
for RNNs, in the finite-dimensional setting, where we find that RNNs are strictly more efficient than FFNN when
approximating causal maps. Section 6concludes. Finally, Appendix Acontains any background material required
in the derivations of our main results whose derivations are relegated to Appendix Band Appendix Dcontains
auxiliary background material on Fréchet spaces and generalized inverses.
1.1 Notation
For the sake of the reader, we collect and define here the notations we will use in the rest of the paper, or we
indicate the exact point where the first appearance of a symbol occurs:
2By universality here, we mean that every α-Hölder function can be approximated by our “static model”, for any 0< α 1.
NB, when all spaces are finite-dimensional then this implies the classical notion of universal approximation, formulated in [54], since
compactly supported smooth functions are 1-Hölder (i.e. Lipschitz) and these are dense in the space of continuous functions between
two Euclidean spaces equipped with the topology of uniform convergence on compact sets.
3We use ˜
Oto omit terms depending logarithmically on Qand T.
4Note every separable Hilbert space carries an orthonormal Schauder basis, so for the reader interested in Hilbert input and output
spaces, we note that these conditions are automatically satisfied in that setting.
5Such as the compact set’s diameter.
Designing Universal Causal Deep Learning Models 5
1. N+: it is the set of natural numbers strictly greater than zero, i.e. 1,2,3,···. On the other hand, we use Nto
denote the positive integers, and Zto denote the integers.
2. [[N]] : it denotes the set of natural numbers between 1and N,NN+, i.e. [[N]] = {1, . . . , N}.
3. Given a topological vector space (F, τ),Fwill denote its topological dual, namely the space of continuous
linear forms on F.
4. Given two topological vector spaces (E, σ)and (F, τ),L(E, F )denotes the space of continuous linear operators
from Einto F; if E=F, then we will write L(E) = L(E, E).
5. Given a Fréchet space F, we use ⟨·,·⟩ to denote the canonical pairing of Fwith its topological dual F,
6. We denote the open ball of radius r > 0about a point xin a metric space (X, d)by Ball(X,d)(x, r)def.
={uX:
d(x, u)< r},
7. We denote the closure of a set Ain a metric space (X, d)by A.
8. P, pk:2.1
9. Φ: (2)
10. βF
kwith F= Fréchet space: (7)
11. dF:nwith F= Fréchet space: (95)
12. [d], P ([d]):2.2
13. PF:n, IF:nwhere Fis a Fréchet space: (11) and (12); furthermore, AF:n
def.
=IF:nPF:n
14. Ck,λ
tr (K, B)and Cλ
α,tr(K, B):4and 5
15. ψnand φn: (14) (15)
16. The canonical projection onto the nth coordinate of an xQnZXnis denoted by xn; where each Xnis an
arbitrary non-empty set.
In particular, if f:AQnZXn, with Aan arbitrary non-empty set, then f(x)ndenotes the projection of
f(x)QnZXnonto the nth coordinate,
17. NF(P)ReLU
[n]: The set of neural filters from Bto E,
18. V: the “special function”, defined as the inverse of the map6u7→ u4log3(u+ 2) on [0,).
19. f: Generalized inverse of a real-valued increasing function fon R, see Appendix D.2.
2 Preliminaries
In this section, we remind some preparatory material for the derivations of the main results of this paper. Finally,
we remark that the notation in each of the subsequent subsections is self-contained and it is the one used on the
cited paper: it will be up to the reader to contextualize it in the next sections.
2.1 Fréchet spaces
The main references for this subsection are the following ones: [48], Part I; [25] Chapter IV; [93], Chapter III and
the working paper of [14]; all the vector spaces we will deal with will be vector spaces over R. Before defining a
Fréchet space, we remind that a locally convex topological vector space, say (F, τ ), is a topological vector space
whose topology τarises from a collection of seminorms P. When clear from the context, we will write Finstead
of (F, τ). The topology is Hausdorff if and only if for every xFwith x̸= 0 there exists a p∈ P such that
p(x)>0. On the other hand, the topology is metrizable if and only if it may be induced by a countable collection
P={pk}kN+of seminorms, which we may assume to be increasing, namely pk(·)pk+1(·), k N+.
Definition 1 (Fréchet space) A Fréchet space Fis a complete metrizable locally convex topological vector space.
Evidently, every Banach space (F, ∥ · ∥F)is a Fréchet space; in this case, simply P={∥ · ∥F}. A canonical choice
for the metric dFon a Fréchet space F(that generates the pre-existing topology) is given by:
dF(x, y)def.
=
X
k=1
2kΦ(pk(xy)), x, y F, (1)
where
Φ(t)def.
=t
1 + t, t 0.(2)
We now remind the concept of directional derivative of a function between two Frećhet spaces. This notion of
differentiation is significantly weaker than the concept of the derivative of a function between two Banach spaces.
Nevertheless, it is the weakest notion of differentiation for which many of the familiar theorems from calculus
hold. In particular, the chain rule is true (cfr. [48]). Let Fand Gbe Fréchet spaces, Uan open subset of F, and
P:UFGa continuous map.
6The map u7→ u4log3(u+ 2) is a continuous and strictly increasing surjection of [0,)onto itself; whence, Vis well-defined.
摘要:

DesigningUniversalCausalDeepLearningModels:TheCaseofInfinite-DimensionalDynamicalSystemsfromStochasticAnalysisLucaGalimberti·AnastasisKratsios·GiuliaLivierithedateofreceiptandacceptanceshouldbeinsertedlaterAbstractSeveralnon-linearoperatorsinstochasticanalysis,suchassolutionmapstostochasticdifferent...

展开>> 收起<<
Designing Universal Causal Deep Learning Models The Case of Infinite-Dimensional Dynamical Systems from Stochastic Analysis Luca Galimberti Anastasis Kratsios Giulia Livieri.pdf

共38页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:38 页 大小:2.14MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 38
客服
关注