Designing Universal Causal Deep Learning Models The Case of Infinite-Dimensional Dynamical Systems from Stochastic Analysis Luca Galimberti Anastasis Kratsios Giulia Livieri

2025-05-06 0 0 2.14MB 38 页 10玖币

侵权投诉

Designing Universal Causal Deep Learning Models: The Case of Inﬁnite-Dimensional

Dynamical Systems from Stochastic Analysis

Luca Galimberti ·Anastasis Kratsios ·Giulia Livieri

the date of receipt and acceptance should be inserted later

Abstract Several non-linear operators in stochastic analysis, such as solution maps to stochastic diﬀerential equa-

tions, depend on a temporal structure which is not leveraged by contemporary neural operators designed to ap-

proximate general maps between Banach space. This paper therefore proposes an operator learning solution to

this open problem by introducing a deep learning model-design framework that takes suitable inﬁnite-dimensional

linear metric spaces, e.g. Banach spaces, as inputs and returns a universal sequential deep learning model adapted

to these linear geometries specialized for the approximation of operators encoding a temporal structure. We call

these models Causal Neural Operators. Our main result states that the models produced by our framework can

uniformly approximate on compact sets and across arbitrarily ﬁnite-time horizons Hölder or smooth trace class op-

erators, which causally map sequences between given linear metric spaces. Our analysis uncovers new quantitative

relationships on the latent state-space dimension of Causal Neural Operators, which even have new implications for

(classical) ﬁnite-dimensional Recurrent Neural Networks. In addition, our guarantees for recurrent neural networks

are tighter than the available results inherited from feedforward neural networks when approximating dynamical

systems between ﬁnite-dimensional spaces.

Keywords Universal Approximation, Causality, Operator Learning, Linear Widths.

Mathematics Subject Classiﬁcation (2020) MSC 68T07 ·MSC 9108 ·37A50 ·65C30 ·60G35 ·41A65

1 Introduction

Inﬁnite-dimensional (non-linear) dynamical systems play a central role in several sciences, especially for disciplines

driven by stochastic analytic modeling. However, despite this fact, the causal neural network approximation theory

for most relevant dynamical systems in stochastic analysis is lacking. Indeed, we currently only comprehend neural

network approximations of stochastic diﬀerential equations (SDEs) with deterministic coeﬃcients (e.g., [43]) and

time-invariant random dynamical systems with the fading memory and echo state property/unique solution property

(e.g., [79,44]). A signiﬁcant problem is causal neural network approximation of solution operators to non-Markovian

SDEs.

Moreover, the understanding of how sequential DL models work is still not fully developed, even in the classical

ﬁnite-dimensional setting. For instance, the seemingly elementary empirical fact that a sequential DL model’s

expressiveness increases when one utilizes a high-dimensional latent state space is understood qualitatively for

general dynamical systems on Euclidean spaces (as in the reservoir computing literature (e.g., [41])).

L. Galimberti

King’s College London

Department of Mathematics

Strand Building, Strand, London, WC2R 2LS

E-mail: luca.galimberti@kcl.ac.uk

A. Kratsios

McMaster University and The Vector Institute

Department of Mathematics

1280 Main Street West, Hamilton, Ontario, L8S 4K1, Canada

E-mail: kratsioa@mcmaster.ca

G. Livieri

London School of Economics (LSE)

Department of Statistics

Columbia House, Houghton Street, London, WC2A 2AE

E-mail: g.livieri@lse.ac.uk

arXiv:2210.13300v3 [math.DS] 10 Apr 2025

2 Luca Galimberti et al.

However, the quantitative understanding of the relationship between a sequential learning model’s state and

its expressiveness remains an open problem. One notable exception to this fact is the approximation of linear

state-space dynamical systems by a stylized class of Recurrent Neural Networks (RNNs, henceforth); see [56,77].

Our contribution. Our paper provides a simple quantitative solution to a far reaching generalization of the above

problem of constructing neural network approximation of inﬁnite-dimensional (generalized) dynamical systems on

“good” linear metric spaces. More precisely, we construct a neural network approximation of any function fthat

“causally” and “regularly” maps sequences (xtn)∞

n=−∞ to sequences (ytn)∞

n=−∞, where each xtnand every ytnlives

in a suitable linear metric space. In particular, we construct our causal neural network approximation framework

on the following desiderata:

(D1) Predictions are causal, i.e., each ytnis predicted independently of (xtm)m>n.

(D2) Each ytnis predicted with a small neural network specialized at time tn.

(D3) Only one of these specialized networks is stored in working memory at a time.

We ﬁrst begin by describing our causal neural network model’s design. Subsequently, we will discuss our ap-

proximation theory’s implications in computational stochastic analysis.

Update Neural Filters

Hypernetwork

Neural Filter Parameters

Time

Out

Fig. 1: The Causal Neural Operator Model:

Summary: An universal approximator of regular causal sequences of operators between well-behaved Fréchet spaces.

Overview: The model successively applies a “universal” neural ﬁlter (see Figure 2) on consecutive time-windows; the internal param-

eters of this neural ﬁlter evolve according to a latent dynamical system on the neural ﬁlter’s parameter space; implemented by a deep

ReLU network called a hypernetwork.

Our neural network model, which we call the Causal Neural Operator (CNO, henceforth) is illustrated in Figure 1

and works in the following way. At any given time tn, it predicts an instance of the output time-series at that time

tnusing an immediate time-window from the input time-series (e.g., it predicts each ytnusing only (xti)n

i=n−10).

At each time tn, this prediction is generated by a non-linear operator deﬁned by a ﬁnitely parameterized neural

network model, called a neural ﬁlter (the vertical black arrows in Figure 1). Our neural network model stores only

one neural ﬁlter’s parameters in working memory at the current time by using an auxiliary deep ReLU neural

network, called a hypernetwork in the machine learning literature (e.g., [47,103]), to generate the next neural ﬁlter

specialized at tn+1 using only the parameters of the current “active” neural ﬁlter specialized at time tn(the blue box

in Figure 1). Thus, a dynamical system (i.e., the hypernetwork) on the neural ﬁlter’s parameter space interpolating

between each neural ﬁlter’s parameters encodes our entire model.

The principal approximation-theoretic advantage of this approach lies in the fact that the hypernetwork is not

designed to approximate anything, but rather, it only needs to memorize/interpolate a ﬁnite number of ﬁnite-

dimensional (parameter) vectors. Since memorization (e.g., [102,68,52]) requires only a polynomial number of

parameters to achieve zero approximation error on a ﬁnite set, while approximation (e.g., [108,69,109,70]) requires

an exponential number of parameters to achieve a possibly non-zero error over a large set containing the ﬁnite set

of interest, then, leveraging memorization yields both lighter (fewer parameters) and more accurate deep learning

models; that is, the constructed neural network model is exponentially more eﬃcient. In particular, using a neural

network for memorization allows the trained DL model to generalize beyond the data it is interpolating, a capability

Designing Universal Causal Deep Learning Models 3

that a simple list does not possess. When both the input and output spaces are ﬁnite-dimensional, our models

eﬀectively reduce to RNNs, which are known for their ability to generalize beyond their training data [104]. This

generalization is attributed to factors such as having a ﬁnite VC (Vapnik-Chervonenkis) dimension [65,94] or ﬁnite

Rademacher complexity [58]. Thus, this neural network design allows us to successfully encode all the parameters

required to approximate long stretches of time {t0, . . . , tN}(for large N) with far fewer parameters (i.e., at the cost

of O(log(N)) additional layers in the hypernetwork). Thus, we successfully achieve desiderata (D1)–(D3) provided

that each neural ﬁlter relies on only a small number of parameters. We show that this is the case whenever fis

“suﬃciently smooth”; the rigorous formulation of all these outlined ideas are expressed in Lemma 5and Theorem

... ...

...

... ...

-9

x 5

x -9

x 0

x 9

x -2

x -6

x 15

Neural Filter

In Out

Fig. 2: The Neural Filter

Summary: An universal approximator of regular maps between any well-behaved Fréchet spaces.

Overview: The neural ﬁlter ﬁrst encodes inputs from a (possibly inﬁnite-dimensional) linear space by approximately representing

the input as coeﬃcients of a sparse (Schauder) basis. These basis coeﬃcients are then transformed by a deep ReLU network and the

network’s outputs are decoded by the coeﬃcients of a sparse basis representation of an element of the output linear space. Assembling

the basis using the outputted coeﬃcients produces the neural ﬁlter’s output.

Though we are focused on the approximation theoretic properties of our modeling framework, we have designed

our CNO by accounting for practical considerations. Namely, we intentionally designed the CNO model so that, like

transformer networks [101], it can be trained non-recursively (via our federated training algorithm, see Algorithm 1

below). This design choice is motivated by the main reasons why the transformer network model (e.g., [101]) has

replaced residual (e.g., [49]) and RNN (especially Long Short-Term Memory (LSTMs, henceforth) [51]) counterparts

in practice (e.g., [53,106]); namely, not back-propagating through time during training. The reason is that omitting

any recurrence relation between a model’s prediction in sequential prediction tasks, at-least during the model’s

construction, has been empirically conﬁrmed to yield more reliable and accurate models trained faster and without

vanishing or exploding gradient problems; see, e.g., [50,88]. Nevertheless, our model does ultimately reap the

beneﬁts of recursive models even if we construct it non-recursively, using our parallelizable training procedure.

The neural ﬁlter, illustrated in Figure 2, is a neural operator with quantitative universal approximation guar-

antees far beyond the Hilbert space setting. It works by ﬁrst encoding inﬁnite-dimensional problems into ﬁnite-

dimensions problems. It then predicts outputs by passing the truncated basis coeﬃcients through a feed-forward

neural network with trainable (P)ReLU activation function. Finally, it reassembles them in the output space by

interpreting the network’s outputs as the coeﬃcients of a pre-speciﬁed Schauder basis or if both spaces are repro-

ducing kernel Hilbert spaces then the ﬁrst few basis functions can learned from data using principal component

analysis1, e.g. as with PCA–Net [74]. A similar encoding-MLP-decoding scheme was also used in [21] for approx-

imately solving nonlinear Kolmogorov equations on Hilbert spaces. We also note that some inﬁnite-dimensional

deep learning models between function spaces on Euclidean domains, such as the DeepONet architecture of [78],

replace the basis vectors with trainable deep neural networks; however, this technique does not readily apply to

general Fréchet spaces.

1Or a robust version thereof, e.g. [39] and then normalizing and orthogonalizing via Gram-Schmidt.

4 Luca Galimberti et al.

Our “static” approximation theorems provides quantitative approximation guarantees for several “neural oper-

ators” used in practice, especially in the numerical Partial Diﬀerential Equations (PDEs), e.g., [61], and in the

inverse-problem literature, e.g., [2,18,3,19,28]. In the static case, the same argument is valid also for the general

qualitative (rate-free) approximation theorems of [97,12,72].

We now describe more in detail the diﬀerent areas in which the present paper contributes.

Our contribution in the Approximation Theory of Neural Operators. Our results provide the ﬁrst set of quantitative

approximation guarantees for generalized dynamical systems evolving on general inﬁnite-dimensional spaces. By

reﬁning the memorizing hypernetwork argument of [1], together with our general solution to the static universal

approximation problem, in the class of Hölder functions2, we are able to conﬁrm a well-known folklore approximation

of dynamical systems literature. Namely, that increasing a sequential neural operator’s latent space’s dimension by

a positive integer Qand our neural network’s depth3by ˜

O(T−Qlog(T−Q)) and width by ˜

O(QT −Q)implies that

we may approximate O(T)more time-steps in the future with the same prescribed approximation error.

To the best of our knowledge, our dynamic result is the only quantitative universal approximation theorem

guaranteeing that a recurrent neural network model can approximate any suitably regular inﬁnite-dimensional

non-linear dynamical systems. Likewise, our static result is to the best of our knowledge the only general inﬁnite-

dimensional guarantee showing that a neural operator enjoys favourable approximation rates when the target map

is smooth enough.

Our contribution in the Approximation Theory of RNNs In the ﬁnite-dimensional context, CNOs become strict

sub-structures of full RNNs, where the internal parameters are updated/generated via an auxiliary hypernetwork.

Noticing this structural inclusion, our results rigorously support the folklore that RNNs may be more suitable when

approximating causal maps, than feedforward neural network (FFNN, henceforth), see Section 5. This is because

our theory yields expression rates for RNN approximations of causal maps between ﬁnite-dimensional spaces, which

are more eﬃcient than currently available comparable rates for FFNNs.

Technical contributions: Our results apply to sequences of non-linear operators between any “good linear” metric

spaces. By “good linear” metric space we mean any Fréchet space admitting Schauder basis. This includes many

natural examples (e.g., the sequence space RNwith its usual metric) outside the scope of the Banach, Hilbert4

spaces carrying Schauder basis and Euclidean settings; which are completely subsumed by our assumptions. In

other words, we treat the most general tractable linear setting where one can hope to obtain quantitative universal

approximation theorems.

Organization of our paper This research project answers theoretical deep learning questions by combining tools

from approximation theory, functional analysis, and stochastic analysis. Therefore, we provide a concise exposition

of each of the relevant tools from these areas in our “preliminaries” Section 2.

Section 3contains our quantitative universal approximation theorems. In the static case, we derive expression

rates for the static component of our model, namely the neural ﬁlters, which depend on the regularity of the target

operator being approximated; from Hölder trace-class to smooth trace-class and on the usual quantities5. Our main

approximation theorem in the dynamic case additionally encodes the target causal map’s memory decay rate.

Section 4.2 applies our main results to derive approximation guarantees for the solution operators of a broad

range of SDEs with stochastic coeﬃcients, possibly having jumps (“stochastic discontinuities”) at times on a pre-

speciﬁed time-grid and with initial random noise. Section 5, examines the implication of our approximation rates

for RNNs, in the ﬁnite-dimensional setting, where we ﬁnd that RNNs are strictly more eﬃcient than FFNN when

approximating causal maps. Section 6concludes. Finally, Appendix Acontains any background material required

in the derivations of our main results whose derivations are relegated to Appendix Band Appendix Dcontains

auxiliary background material on Fréchet spaces and generalized inverses.

1.1 Notation

For the sake of the reader, we collect and deﬁne here the notations we will use in the rest of the paper, or we

indicate the exact point where the ﬁrst appearance of a symbol occurs:

2By universality here, we mean that every α-Hölder function can be approximated by our “static model”, for any 0< α ≤1.

NB, when all spaces are ﬁnite-dimensional then this implies the classical notion of universal approximation, formulated in [54], since

compactly supported smooth functions are 1-Hölder (i.e. Lipschitz) and these are dense in the space of continuous functions between

two Euclidean spaces equipped with the topology of uniform convergence on compact sets.

3We use ˜

Oto omit terms depending logarithmically on Qand T.

4Note every separable Hilbert space carries an orthonormal Schauder basis, so for the reader interested in Hilbert input and output

spaces, we note that these conditions are automatically satisﬁed in that setting.

5Such as the compact set’s diameter.

Designing Universal Causal Deep Learning Models 5

1. N+: it is the set of natural numbers strictly greater than zero, i.e. 1,2,3,···. On the other hand, we use Nto

denote the positive integers, and Zto denote the integers.

2. [[N]] : it denotes the set of natural numbers between 1and N,N∈N+, i.e. [[N]] = {1, . . . , N}.

3. Given a topological vector space (F, τ),F′will denote its topological dual, namely the space of continuous

linear forms on F.

4. Given two topological vector spaces (E, σ)and (F, τ),L(E, F )denotes the space of continuous linear operators

from Einto F; if E=F, then we will write L(E) = L(E, E).

5. Given a Fréchet space F, we use ⟨·,·⟩ to denote the canonical pairing of Fwith its topological dual F′,

6. We denote the open ball of radius r > 0about a point xin a metric space (X, d)by Ball(X,d)(x, r)def.

={u∈X:

d(x, u)< r},

7. We denote the closure of a set Ain a metric space (X, d)by A.

8. P, pk:2.1

9. Φ: (2)

10. βF

kwith F= Fréchet space: (7)

11. dF:nwith F= Fréchet space: (95)

12. [d], P ([d]):2.2

13. PF:n, IF:nwhere Fis a Fréchet space: (11) and (12); furthermore, AF:n

def.

=IF:n◦PF:n

14. Ck,λ

tr (K, B)and Cλ

α,tr(K, B):4and 5

15. ψnand φn: (14) (15)

16. The canonical projection onto the nth coordinate of an x∈Qn∈ZXnis denoted by xn; where each Xnis an

arbitrary non-empty set.

In particular, if f:A→Qn∈ZXn, with Aan arbitrary non-empty set, then f(x)ndenotes the projection of

f(x)∈Qn∈ZXnonto the nth coordinate,

17. NF(P)ReLU

[n]: The set of neural ﬁlters from Bto E,

18. V: the “special function”, deﬁned as the inverse of the map6u7→ u4log3(u+ 2) on [0,∞).

19. f−: Generalized inverse of a real-valued increasing function fon R, see Appendix D.2.

2 Preliminaries

In this section, we remind some preparatory material for the derivations of the main results of this paper. Finally,

we remark that the notation in each of the subsequent subsections is self-contained and it is the one used on the

cited paper: it will be up to the reader to contextualize it in the next sections.

2.1 Fréchet spaces

The main references for this subsection are the following ones: [48], Part I; [25] Chapter IV; [93], Chapter III and

the working paper of [14]; all the vector spaces we will deal with will be vector spaces over R. Before deﬁning a

Fréchet space, we remind that a locally convex topological vector space, say (F, τ ), is a topological vector space

whose topology τarises from a collection of seminorms P. When clear from the context, we will write Finstead

of (F, τ). The topology is Hausdorﬀ if and only if for every x∈Fwith x̸= 0 there exists a p∈ P such that

p(x)>0. On the other hand, the topology is metrizable if and only if it may be induced by a countable collection

P={pk}k∈N+of seminorms, which we may assume to be increasing, namely pk(·)≤pk+1(·), k ∈N+.

Deﬁnition 1 (Fréchet space) A Fréchet space Fis a complete metrizable locally convex topological vector space.

Evidently, every Banach space (F, ∥ · ∥F)is a Fréchet space; in this case, simply P={∥ · ∥F}. A canonical choice

for the metric dFon a Fréchet space F(that generates the pre-existing topology) is given by:

dF(x, y)def.

∞

k=1

2−kΦ(pk(x−y)), x, y ∈F, (1)

where

Φ(t)def.

1 + t, t ≥0.(2)

We now remind the concept of directional derivative of a function between two Frećhet spaces. This notion of

diﬀerentiation is signiﬁcantly weaker than the concept of the derivative of a function between two Banach spaces.

Nevertheless, it is the weakest notion of diﬀerentiation for which many of the familiar theorems from calculus

hold. In particular, the chain rule is true (cfr. [48]). Let Fand Gbe Fréchet spaces, Uan open subset of F, and

P:U⊆F→Ga continuous map.

6The map u7→ u4log3(u+ 2) is a continuous and strictly increasing surjection of [0,∞)onto itself; whence, Vis well-deﬁned.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DesigningUniversalCausalDeepLearningModels:TheCaseofInfinite-DimensionalDynamicalSystemsfromStochasticAnalysisLucaGalimberti·AnastasisKratsios·GiuliaLivierithedateofreceiptandacceptanceshouldbeinsertedlaterAbstractSeveralnon-linearoperatorsinstochasticanalysis,suchassolutionmapstostochasticdifferent...

展开>> 收起<<

Designing Universal Causal Deep Learning Models The Case of Infinite-Dimensional Dynamical Systems from Stochastic Analysis Luca Galimberti Anastasis Kratsios Giulia Livieri.pdf

共38页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Designing Universal Causal Deep Learning Models The Case of Infinite-Dimensional Dynamical Systems from Stochastic Analysis Luca Galimberti Anastasis Kratsios Giulia Livieri

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: