Mitigating spectral bias for the multiscale operator learning

2025-05-02 0 0 4.1MB 29 页 10玖币
侵权投诉
Mitigating spectral bias for the multiscale operator learning
Xinliang Liua, Bo Xub, Shuhao Caoc, Lei Zhangd
a
Computer, Electrical and Mathematical Science and Engineering Division, King Abdullah University of Science and
Technology , Thuwal, 23955, Saudi Arabia
bSchool of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China
cSchool of Science and Engineering, University of Missouri-Kansas City, Kansas City, 64110, MO, United States
dSchool of Mathematical Sciences, Institute of Natural Sciences and MOE-LSC, Shanghai Jiao Tong
University, Shanghai, 200240, China
Abstract
Neural operators have emerged as a powerful tool for learning the mapping between infinite-
dimensional parameter and solution spaces of partial differential equations (PDEs). In this work,
we focus on multiscale PDEs that have important applications such as reservoir modeling and
turbulence prediction. We demonstrate that for such PDEs, the spectral bias towards low-frequency
components presents a significant challenge for existing neural operators. To address this challenge,
we propose a hierarchical attention neural operator (HANO) inspired by the hierarchical matrix
approach. HANO features a scale-adaptive interaction range and self-attentions over a hierarchy of
levels, enabling nested feature computation with controllable linear cost and encoding/decoding of
multiscale solution space. We also incorporate an empirical
H1
loss function to enhance the learning
of high-frequency components. Our numerical experiments demonstrate that HANO outperforms
state-of-the-art (SOTA) methods for representative multiscale problems.
Keywords: partial differential equations, operator learning, transformer, multiscale PDE
1. Introduction
In recent years, operator learning methods have emerged as powerful tools for computing
parameter-to-solution maps of partial differential equations (PDEs). In this paper, we focus on the
operator learning for multiscale PDEs (MsPDEs) that encompass multiple temporal/spatial scales.
MsPDE models arise in applications involving heterogeneous and random media, and are crucial
for predicting complex phenomena such as reservoir modeling, atmospheric and ocean circulation,
and high-frequency scattering. Important prototypical examples include multiscale elliptic partial
differential equations, where the diffusion coefficients vary rapidly. The coefficient can be potentially
rapidly oscillatory, have high contrast ratio, or even bear a continuum of non-separable scales.
MsPDEs, even with fixed parameters, present great challenges for classical numerical methods
[
1
], as their computational cost typically scales inversely proportional to the finest scale
ε
of the
problem. To overcome this issue, multiscale solvers have been developed by incorporating microscopic
information to achieve computational cost independent of
ε
. One such technique is numerical
homogenization [
2
,
3
,
4
,
5
,
6
,
7
], which identifies low-dimensional approximation spaces adapted to
the corresponding multiscale operator. Similarly, fast solvers like multilevel/multigrid methods [
8
,
9
]
and wavelet-based multiresolution methods [
10
,
11
] may face limitations when applied to multiscale
PDEs [
1
], while multilevel methods based on numerical homogenization techniques, such as Gamblets
Preprint submitted to Elsevier June 11, 2024
arXiv:2210.10890v3 [cs.LG] 9 Jun 2024
[
12
], have emerged as a way to discover scalable multilevel algorithms and operator-adapted wavelets
for multiscale PDEs. Low-rank decomposition-based methods are another popular approach to
exploit the low-dimensional nature of MsPDEs. Notable example include the fast multipole method
[
13
], hierarchical matrices (
H
and
H2
matrices) [
14
], and hierarchical interpolative factorization
[
15
]. These methods can achieve (near-)linear scaling and high computational efficiency by exploiting
the low-rank approximation of the (elliptic) Green’s function [16].
Neural operators, unlike traditional solvers that operate with fixed parameters, are capable
of handling a range of input parameters, making them promising for data-driven forward and
inverse solving of PDE problems. Pioneering work in operator learning methods include [
17
,
18
,
19
,
20
]. Nevertheless, they are limited to problems with fixed discretization sizes. Recently, infinite-
dimensional operator learning has been studied, which learns the solution operator (mapping)
between infinite-dimensional Banach spaces for PDEs. Most notably, the Deep Operator Network
(DeepONet) [
21
] was proposed as a pioneering model to leverage deep neural networks’ universal
approximation for operators [
22
]. Taking advantage of the Fast Fourier Transform (FFT), Fourier
Neural Operator (FNO) [
23
] constructs a learnable parametrized kernel in the frequency domain
to render the convolutions in the solution operator more efficient. Other developments include the
multiwavelet extension of FNO [
24
], Message-Passing Neural Operators [
25
], dimension reduction
in the latent space [
26
], Gaussian Processes [
27
], Clifford algebra-inspired neural layers [
28
], and
Dilated convolutional residual network [29].
Attention neural architectures, popularized by the Transformer deep neural network [
30
], have
emerged as universal backbones in Deep Learning. These architectures serve as the foundation for
numerous state-of-the-art models, including GPT [
31
], Vision Transformer (ViT) [
32
], and Diffusion
models [
33
,
34
]. More recently, Transformers have been studied and become increasingly popular in
PDE operator learning problems, e.g., in [
35
,
36
,
37
,
38
,
39
,
40
,
41
] and many others. There are
several advantages in the attention architectures. Attention can be viewed as a parametrized instance-
dependent kernel integral to learn the “basis” [
35
] similar to those in the numerical homogenization;
see also the exposition featured in neural operators [
42
]. This layerwise latent updating resembles
the learned “basis” in DeepONet [
39
], or frame [
43
]. It is flexible to encode the non-uniform
geometries in the latent space [
44
]. In [
45
,
46
], advanced Transformer architectures (ViT) and
Diffusion models are combined with the neural operator framework. In [
47
], Transformers are
combined with reduced-order modeling to accelerate the fluid simulation for turbulent flows. In [
48
],
tensor decomposition techniques are employed to enhance the efficiency of attention mechanisms in
solving high-dimensional partial differential equation (PDE) problems.
Among these data-driven operator learning models, under certain circumstances, the numerical
results could sometimes overtake classical numerical methods in terms of efficiency or even in
accuracy. For instance, full wave inversion is considered in [
49
] with the fusion model of FNO and
DeepONet (Fourier-DeepONet); direct methods-inspired DNNs are applied to the boundary value
Calder´on problems achieve much more accurate reconstruction with the help of data [
50
,
51
,
52
]; in
[
53
], the capacity of FNO to jump significantly large time steps for spatialtemporal PDEs is exploited
to infer the wave packet scattering in quantum physics and achieves magnitudes more efficient result
than traditional implicit Euler marching scheme. [
54
] exploits the capacity of graph neural networks
to accelerate particle-based simulations. [
55
] investigates the integration of the neural operator
DeepONet with classical relaxation techniques, resulting in a hybrid iterative approach. Meanwhile,
Wu et al. [
56
] introduce an asymptotic-preserving convolutional DeepONet designed to capture the
diffusive characteristics of multiscale linear transport equations.
For multiscale PDEs, operator learning methods can be viewed as an advancement beyond
2
multiscale solvers such as numerical homogenization. Operator learning methods have two key
advantages: (1) They can be applied to an ensemble of coefficients/parameters, rather than a single
set of coefficients, which allows the methods to capture the stochastic behaviors of the coefficients; (2)
The decoder in the operator learning framework can be interpreted as a data-driven basis reduction
procedure from the latent space (high-dimensional) that approximates the solution data manifold
(often lower-dimensional) of the underlying PDEs. This procedure offers automated data adaptation
to the coefficients, enabling accurate representations of the solutions’ distributions. In contrast,
numerical homogenization typically relies on a priori bases that are not adapted to the ensemble of
coefficients. In this regard, the operator learning approach has the potential to yield more accurate
reduced-order models for multiscale PDEs with parametric/random coefficients.
However, for multiscale problems, current operator learning methods have primarily focused on
representing the smooth parts of the solution space. This results in the so-called “spectral bias”,
leaving the resolution of intrinsic multiscale features as a significant challenge. The spectral bias,
also known as the frequency principle [
57
,
58
,
59
], states that deep neural networks (DNNs) often
struggle to learn high-frequency components of functions that vary at multiple scales. In this regard,
Fourier or wavelet-based methods are not always effective for MsPDEs, even for fixed parameters.
Neural operators tend to fit low-frequency components faster than high-frequency ones, limiting their
ability to accurately capture fine details. When the elliptic coefficients are smooth, the coefficient to
solution map can be well resolved by the FNO parameterization [
23
]. Nevertheless, existing neural
operators have difficulty learning high-frequency components of multiscale PDEs, as is shown in
Figure 1and detailed in Section 3. While the universal approximation theorems can be proven for
FNO type models (see e.g., [
60
]), achieving a meaningful decay rate requires “extra smoothness”,
which may be absent or lead to large constants for MsPDEs. For FNO, this issue was partially
addressed in [
61
], yet the approach there needs an ad-hoc manual tweak on the weights for the
modes chosen.
We note that for fixed parameter MsPDEs, In recent years, there has been increasing exploration
of neural network methods for solving multiscale PDEs despite the spectral bias or frequency
principle [
57
,
58
,
59
] indicating that deep neural networks (DNNs) often struggle to effectively
capture high-frequency components of functions. Specifically designed neural solvers [
62
,
63
,
64
]
have been developed to mitigate the spectral bias and accurately solve multiscale PDEs with fixed
parameters.
Motivated by the aforementioned challenges, we investigate the spectral bias present in existing
neural operators. Inspired by conventional multilevel methods and numerical homogenization, we
propose a new Hierarchical Attention Neural Operator (HANO) architecture to mitigate it for
multiscale operator learning. We also test our model on standard operator learning benchmarks
including the Navier-Stokes equation in the turbulent regime, and the Helmholtz equation in the
high wave number regime. Our main contributions can be summarized as follows:
We introduce HANO, that decomposes input-output mapping into hierarchical levels in an
automated fashion, and enables nested feature updates through hierarchical local aggregation
of self-attentions with a controllable linear computational cost.
We use an empirical
H1
loss function to further reduce the spectral bias and improve the
ability to capture the oscillatory features of the multiscale solution space;
We investigate the spectral bias in the existing neural operators and empirically verify that
HANO is able to mitigate the spectral bias. HANO substantially improves accuracy, particularly
3
for approximating derivatives, and generalization for multiscale tasks, compared with state-of-
the-art neural operators and efficient attention/transformers.
(a) multiscale trigonometric
coefficient,
(b) slices of the derivatives
u
y at x= 0,
(c) absolute error spectrum
of HANO in log10 scale
(d) absolute error spectrum
of FNO in log10 scale.
Figure 1: We illustrate the effectiveness of the HANO scheme on the challenging multiscale trigonometric benchmark,
with the coefficients and corresponding solution derivative shown in (a) and (b), see Appendix 3.1.2 for problem
description. We notice that HANO can capture the solution derivatives more accurately, whereas FNO only captures
their averaged or homogenized behavior. In (c) and (d), we analyze the error by decomposing it into the frequency
domain [
256
π,
256
π
]
2
and plotting the absolute error spectrum. This shows the spectral bias in the existing
state-of-the-art model, and also our method achieves superior performance in predicting fine-scale features, especially
accurately capturing derivatives. We refer readers to Figure 7in Section 3.1 and Figures 7,8,9.
2. Methods
In this section, to address the spectral bias for multiscale operator learning, and motivated by the
remarkable performance of attention-based models [
30
,
65
] in computer vision and natural language
processing tasks, as well as the effectiveness of hierarchical matrix approach [
14
] for multiscale
problems, we propose the Hierarchical Attention Neural Operator (HANO) model.
2.1. Operator Learning Problem
We follow the setup in [
23
,
21
] to approximate the operator
S
:
a7→ u
:=
S
(
a
), with the
input/parameter
a∈ A
drawn from a distribution
µ
and the corresponding output/solution
u∈ U
,
where
A
and
U
are infinite-dimensional Banach spaces, respectively. Our aim is to learn the operator
S
from a collection of finitely observed input-output pairs through a parametric map
N
:
A×
Θ
→ U
and a loss functional L:U × U R, such that the optimal parameter
θ= arg min
θΘ
Eaµ[L(N(a, θ),S(a))] .
2.1.1. Hierarchical Discretization
To develop a hierarchical attention, first we assume that there is a hierarchical discretization of
the spatial domain
D
. For an input feature map that is defined on a partition of
D
, for example, of
resolution 8
×
8 patches, we define
I(3)
:=
{i
= (
i1, i2, i3
)
|i1, i2, i3∈ {
0
,
1
,
2
,
3
}}
as the finest level
index set, in which each index
i
corresponds to a patch token characterized by a feature vector
f(3)
iRC(3)
. For a token
i
= (
i1, i2, i3
), its parent token
j
=
(i1, i2)
aggregates finer level tokens
4
(e.g., (1
,
1) is the parent of (1
,
1
,
0)
,
(1
,
1
,
1)
,
(1
,
1
,
2)
,
(1
,
1
,
3) in Figure 2), characterized by a feature
vector
f(2)
jRC(2)
. We postpone describing the aggregation scheme in the following paragraph.
In general, we write
I(m)
:=
{i
= (
i1, i2, ..., im
)
|i∈ {
0
,
1
,
2
,
3
}for
= 1
, ..., m}
as the index set of
m
-th level tokens, and
I(r)
for
r
1 denotes the index set of the finest level tokens. Note that the
hierarchy is not restricted to the quadtree setting.
Figure 2: Hierarchical discretization and index tree. The 2D unit square is discretized hierarchically into
three levels with corresponding index sets
I(1)
,I(2)
, and
I(3)
. To illustrate, (1)
(1,2)
represents the second
level child nodes of node (1) and is defined as (1)(1,2) ={(1,0),(1,1),(1,2),(1,3)}.
2.2. Vanilla Attention Mechanism
In this section, we first revisit the vanilla scaled dot-product attention mechanism for a single-level
discretization. Without loss of generality, for example, we consider the finest level tokens, which
is
f(r)
iRC(r)
, are indexed by
i∈ I(r)
. The token aggregation formula on this level can then be
expressed as:
atten :h(r)
i=X
j∈I(r)G(q(r)
i,k(r)
j)v(r)
j,(1)
where
q(r)
i
=
WQf(r)
i
,
k(r)
i
=
WKf(r)
i
,
v(r)
i
=
WVf(r)
i
, and
WQ
,
WK
,
WVRC(r)×C(r)
are
learnable matrices. Here, for simplicity, we use the function
G
to represent a pairwise interaction
between queries and keys in the self-attention mechanism. Note that in the conventional self-attention
mechanism [30], the pairwise interaction potential is defined as follows:
G(q(r)
i,k(r)
j) := exp(q(r)
i·k(r)
j/pC(r)) (2)
and further normalized to have row sum 1, i.e., the
softmax
function is applied row-wise to the
matrix whose (
i, j
)-entry is
q(r)
i·k(r)
j
. Note that the 1
/C(r)
factor is optional and can be set to 1
instead. To be more specific, the vanilla self-attention is finally defined by
vanilla atten :h(r)
i=X
j∈I(r)
G(q(r)
i,k(r)
j)
Pj∈I(r)G(q(r)
i,k(r)
j)v(r)
j.(3)
2.3. Hierarchical attention
In this section, we present HANO in Algorithm 1, a hierarchically nested attention scheme with
O
(
N
) cost inspired by
H2
matrices [
66
], which is much more efficient than the vanilla attention
above that scales with
O
(
N2
). The overall HANO scheme (e.g., for a three-level example see Figure
3) resembles the V-cycle operations in multigrid methods, and it comprises four key operations:
5
摘要:

MitigatingspectralbiasforthemultiscaleoperatorlearningXinliangLiua,BoXub,ShuhaoCaoc,LeiZhangdaComputer,ElectricalandMathematicalScienceandEngineeringDivision,KingAbdullahUniversityofScienceandTechnology,Thuwal,23955,SaudiArabiabSchoolofMathematicalSciences,ShanghaiJiaoTongUniversity,Shanghai,200240,...

展开>> 收起<<
Mitigating spectral bias for the multiscale operator learning.pdf

共29页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:29 页 大小:4.1MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 29
客服
关注