Mitigating spectral bias for the multiscale operator learning

2025-05-02 0 0 4.1MB 29 页 10玖币

侵权投诉

Xinliang Liua, Bo Xub, Shuhao Caoc, Lei Zhangd

Computer, Electrical and Mathematical Science and Engineering Division, King Abdullah University of Science and

Technology , Thuwal, 23955, Saudi Arabia

bSchool of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China

cSchool of Science and Engineering, University of Missouri-Kansas City, Kansas City, 64110, MO, United States

dSchool of Mathematical Sciences, Institute of Natural Sciences and MOE-LSC, Shanghai Jiao Tong

University, Shanghai, 200240, China

Abstract

Neural operators have emerged as a powerful tool for learning the mapping between inﬁnite-

dimensional parameter and solution spaces of partial diﬀerential equations (PDEs). In this work,

we focus on multiscale PDEs that have important applications such as reservoir modeling and

turbulence prediction. We demonstrate that for such PDEs, the spectral bias towards low-frequency

components presents a signiﬁcant challenge for existing neural operators. To address this challenge,

we propose a hierarchical attention neural operator (HANO) inspired by the hierarchical matrix

approach. HANO features a scale-adaptive interaction range and self-attentions over a hierarchy of

levels, enabling nested feature computation with controllable linear cost and encoding/decoding of

multiscale solution space. We also incorporate an empirical

loss function to enhance the learning

of high-frequency components. Our numerical experiments demonstrate that HANO outperforms

state-of-the-art (SOTA) methods for representative multiscale problems.

Keywords: partial diﬀerential equations, operator learning, transformer, multiscale PDE

1. Introduction

In recent years, operator learning methods have emerged as powerful tools for computing

parameter-to-solution maps of partial diﬀerential equations (PDEs). In this paper, we focus on the

operator learning for multiscale PDEs (MsPDEs) that encompass multiple temporal/spatial scales.

MsPDE models arise in applications involving heterogeneous and random media, and are crucial

for predicting complex phenomena such as reservoir modeling, atmospheric and ocean circulation,

and high-frequency scattering. Important prototypical examples include multiscale elliptic partial

diﬀerential equations, where the diﬀusion coeﬃcients vary rapidly. The coeﬃcient can be potentially

rapidly oscillatory, have high contrast ratio, or even bear a continuum of non-separable scales.

MsPDEs, even with ﬁxed parameters, present great challenges for classical numerical methods

[

], as their computational cost typically scales inversely proportional to the ﬁnest scale

of the

problem. To overcome this issue, multiscale solvers have been developed by incorporating microscopic

information to achieve computational cost independent of

. One such technique is numerical

homogenization [

], which identiﬁes low-dimensional approximation spaces adapted to

the corresponding multiscale operator. Similarly, fast solvers like multilevel/multigrid methods [

]

and wavelet-based multiresolution methods [

] may face limitations when applied to multiscale

PDEs [

], while multilevel methods based on numerical homogenization techniques, such as Gamblets

Preprint submitted to Elsevier June 11, 2024

arXiv:2210.10890v3 [cs.LG] 9 Jun 2024

[

], have emerged as a way to discover scalable multilevel algorithms and operator-adapted wavelets

for multiscale PDEs. Low-rank decomposition-based methods are another popular approach to

exploit the low-dimensional nature of MsPDEs. Notable example include the fast multipole method

[

], hierarchical matrices (

and

matrices) [

], and hierarchical interpolative factorization

[

]. These methods can achieve (near-)linear scaling and high computational eﬃciency by exploiting

the low-rank approximation of the (elliptic) Green’s function [16].

Neural operators, unlike traditional solvers that operate with ﬁxed parameters, are capable

of handling a range of input parameters, making them promising for data-driven forward and

inverse solving of PDE problems. Pioneering work in operator learning methods include [

]. Nevertheless, they are limited to problems with ﬁxed discretization sizes. Recently, inﬁnite-

dimensional operator learning has been studied, which learns the solution operator (mapping)

between inﬁnite-dimensional Banach spaces for PDEs. Most notably, the Deep Operator Network

(DeepONet) [

] was proposed as a pioneering model to leverage deep neural networks’ universal

approximation for operators [

]. Taking advantage of the Fast Fourier Transform (FFT), Fourier

Neural Operator (FNO) [

] constructs a learnable parametrized kernel in the frequency domain

to render the convolutions in the solution operator more eﬃcient. Other developments include the

multiwavelet extension of FNO [

], Message-Passing Neural Operators [

], dimension reduction

in the latent space [

], Gaussian Processes [

], Cliﬀord algebra-inspired neural layers [

], and

Dilated convolutional residual network [29].

Attention neural architectures, popularized by the Transformer deep neural network [

], have

emerged as universal backbones in Deep Learning. These architectures serve as the foundation for

numerous state-of-the-art models, including GPT [

], Vision Transformer (ViT) [

], and Diﬀusion

models [

]. More recently, Transformers have been studied and become increasingly popular in

PDE operator learning problems, e.g., in [

] and many others. There are

several advantages in the attention architectures. Attention can be viewed as a parametrized instance-

dependent kernel integral to learn the “basis” [

] similar to those in the numerical homogenization;

see also the exposition featured in neural operators [

]. This layerwise latent updating resembles

the learned “basis” in DeepONet [

], or frame [

]. It is ﬂexible to encode the non-uniform

geometries in the latent space [

]. In [

], advanced Transformer architectures (ViT) and

Diﬀusion models are combined with the neural operator framework. In [

], Transformers are

combined with reduced-order modeling to accelerate the ﬂuid simulation for turbulent ﬂows. In [

tensor decomposition techniques are employed to enhance the eﬃciency of attention mechanisms in

solving high-dimensional partial diﬀerential equation (PDE) problems.

Among these data-driven operator learning models, under certain circumstances, the numerical

results could sometimes overtake classical numerical methods in terms of eﬃciency or even in

accuracy. For instance, full wave inversion is considered in [

] with the fusion model of FNO and

DeepONet (Fourier-DeepONet); direct methods-inspired DNNs are applied to the boundary value

Calder´on problems achieve much more accurate reconstruction with the help of data [

]; in

[

], the capacity of FNO to jump signiﬁcantly large time steps for spatialtemporal PDEs is exploited

to infer the wave packet scattering in quantum physics and achieves magnitudes more eﬃcient result

than traditional implicit Euler marching scheme. [

] exploits the capacity of graph neural networks

to accelerate particle-based simulations. [

] investigates the integration of the neural operator

DeepONet with classical relaxation techniques, resulting in a hybrid iterative approach. Meanwhile,

Wu et al. [

] introduce an asymptotic-preserving convolutional DeepONet designed to capture the

diﬀusive characteristics of multiscale linear transport equations.

For multiscale PDEs, operator learning methods can be viewed as an advancement beyond

multiscale solvers such as numerical homogenization. Operator learning methods have two key

advantages: (1) They can be applied to an ensemble of coeﬃcients/parameters, rather than a single

set of coeﬃcients, which allows the methods to capture the stochastic behaviors of the coeﬃcients; (2)

The decoder in the operator learning framework can be interpreted as a data-driven basis reduction

procedure from the latent space (high-dimensional) that approximates the solution data manifold

(often lower-dimensional) of the underlying PDEs. This procedure oﬀers automated data adaptation

to the coeﬃcients, enabling accurate representations of the solutions’ distributions. In contrast,

numerical homogenization typically relies on a priori bases that are not adapted to the ensemble of

coeﬃcients. In this regard, the operator learning approach has the potential to yield more accurate

reduced-order models for multiscale PDEs with parametric/random coeﬃcients.

However, for multiscale problems, current operator learning methods have primarily focused on

representing the smooth parts of the solution space. This results in the so-called “spectral bias”,

leaving the resolution of intrinsic multiscale features as a signiﬁcant challenge. The spectral bias,

also known as the frequency principle [

], states that deep neural networks (DNNs) often

struggle to learn high-frequency components of functions that vary at multiple scales. In this regard,

Fourier or wavelet-based methods are not always eﬀective for MsPDEs, even for ﬁxed parameters.

Neural operators tend to ﬁt low-frequency components faster than high-frequency ones, limiting their

ability to accurately capture ﬁne details. When the elliptic coeﬃcients are smooth, the coeﬃcient to

solution map can be well resolved by the FNO parameterization [

]. Nevertheless, existing neural

operators have diﬃculty learning high-frequency components of multiscale PDEs, as is shown in

Figure 1and detailed in Section 3. While the universal approximation theorems can be proven for

FNO type models (see e.g., [

]), achieving a meaningful decay rate requires “extra smoothness”,

which may be absent or lead to large constants for MsPDEs. For FNO, this issue was partially

addressed in [

], yet the approach there needs an ad-hoc manual tweak on the weights for the

modes chosen.

We note that for ﬁxed parameter MsPDEs, In recent years, there has been increasing exploration

of neural network methods for solving multiscale PDEs despite the spectral bias or frequency

principle [

] indicating that deep neural networks (DNNs) often struggle to eﬀectively

capture high-frequency components of functions. Speciﬁcally designed neural solvers [

]

have been developed to mitigate the spectral bias and accurately solve multiscale PDEs with ﬁxed

parameters.

Motivated by the aforementioned challenges, we investigate the spectral bias present in existing

neural operators. Inspired by conventional multilevel methods and numerical homogenization, we

propose a new Hierarchical Attention Neural Operator (HANO) architecture to mitigate it for

multiscale operator learning. We also test our model on standard operator learning benchmarks

including the Navier-Stokes equation in the turbulent regime, and the Helmholtz equation in the

high wave number regime. Our main contributions can be summarized as follows:

•

We introduce HANO, that decomposes input-output mapping into hierarchical levels in an

automated fashion, and enables nested feature updates through hierarchical local aggregation

of self-attentions with a controllable linear computational cost.

•

We use an empirical

loss function to further reduce the spectral bias and improve the

ability to capture the oscillatory features of the multiscale solution space;

•

We investigate the spectral bias in the existing neural operators and empirically verify that

HANO is able to mitigate the spectral bias. HANO substantially improves accuracy, particularly

for approximating derivatives, and generalization for multiscale tasks, compared with state-of-

the-art neural operators and eﬃcient attention/transformers.

(a) multiscale trigonometric

coeﬃcient,

(b) slices of the derivatives

∂u

∂y at x= 0,

of HANO in log10 scale

(d) absolute error spectrum

of FNO in log10 scale.

Figure 1: We illustrate the eﬀectiveness of the HANO scheme on the challenging multiscale trigonometric benchmark,

with the coeﬃcients and corresponding solution derivative shown in (a) and (b), see Appendix 3.1.2 for problem

description. We notice that HANO can capture the solution derivatives more accurately, whereas FNO only captures

their averaged or homogenized behavior. In (c) and (d), we analyze the error by decomposing it into the frequency

domain [

−

256

π,

256

]

and plotting the absolute error spectrum. This shows the spectral bias in the existing

state-of-the-art model, and also our method achieves superior performance in predicting ﬁne-scale features, especially

accurately capturing derivatives. We refer readers to Figure 7in Section 3.1 and Figures 7,8,9.

2. Methods

In this section, to address the spectral bias for multiscale operator learning, and motivated by the

remarkable performance of attention-based models [

] in computer vision and natural language

processing tasks, as well as the eﬀectiveness of hierarchical matrix approach [

] for multiscale

problems, we propose the Hierarchical Attention Neural Operator (HANO) model.

2.1. Operator Learning Problem

We follow the setup in [

] to approximate the operator

a7→ u

(

), with the

input/parameter

a∈ A

drawn from a distribution

and the corresponding output/solution

u∈ U

where

and

are inﬁnite-dimensional Banach spaces, respectively. Our aim is to learn the operator

from a collection of ﬁnitely observed input-output pairs through a parametric map

A×

→ U

and a loss functional L:U × U → R, such that the optimal parameter

θ∗= arg min

θ∈Θ

Ea∼µ[L(N(a, θ),S(a))] .

2.1.1. Hierarchical Discretization

To develop a hierarchical attention, ﬁrst we assume that there is a hierarchical discretization of

the spatial domain

. For an input feature map that is deﬁned on a partition of

, for example, of

resolution 8

8 patches, we deﬁne

I(3)

= (

i1, i2, i3

)

|i1, i2, i3∈ {

}}

as the ﬁnest level

index set, in which each index

corresponds to a patch token characterized by a feature vector

f(3)

i∈RC(3)

. For a token

= (

i1, i2, i3

), its parent token

(i1, i2)

aggregates ﬁner level tokens

(e.g., (1

1) is the parent of (1

3) in Figure 2), characterized by a feature

vector

f(2)

j∈RC(2)

. We postpone describing the aggregation scheme in the following paragraph.

In general, we write

I(m)

= (

i1, i2, ..., im

)

|iℓ∈ {

}for ℓ

= 1

, ..., m}

as the index set of

-th level tokens, and

I(r)

for

r≥

1 denotes the index set of the ﬁnest level tokens. Note that the

hierarchy is not restricted to the quadtree setting.

Figure 2: Hierarchical discretization and index tree. The 2D unit square is discretized hierarchically into

three levels with corresponding index sets

I(1)

,I(2)

, and

I(3)

. To illustrate, (1)

(1,2)

represents the second

level child nodes of node (1) and is deﬁned as (1)(1,2) ={(1,0),(1,1),(1,2),(1,3)}.

2.2. Vanilla Attention Mechanism

In this section, we ﬁrst revisit the vanilla scaled dot-product attention mechanism for a single-level

discretization. Without loss of generality, for example, we consider the ﬁnest level tokens, which

f(r)

i∈RC(r)

, are indexed by

i∈ I(r)

. The token aggregation formula on this level can then be

expressed as:

atten :h(r)

i=X

j∈I(r)G(q(r)

i,k(r)

j)v(r)

j,(1)

where

q(r)

WQf(r)

k(r)

WKf(r)

v(r)

WVf(r)

, and

WV∈RC(r)×C(r)

are

learnable matrices. Here, for simplicity, we use the function

to represent a pairwise interaction

between queries and keys in the self-attention mechanism. Note that in the conventional self-attention

mechanism [30], the pairwise interaction potential is deﬁned as follows:

G(q(r)

i,k(r)

j) := exp(q(r)

i·k(r)

j/pC(r)) (2)

and further normalized to have row sum 1, i.e., the

softmax

function is applied row-wise to the

matrix whose (

i, j

)-entry is

q(r)

i·k(r)

. Note that the 1

/√C(r)

factor is optional and can be set to 1

instead. To be more speciﬁc, the vanilla self-attention is ﬁnally deﬁned by

vanilla atten :h(r)

i=X

j∈I(r)

G(q(r)

i,k(r)

Pj∈I(r)G(q(r)

i,k(r)

j)v(r)

j.(3)

2.3. Hierarchical attention

In this section, we present HANO in Algorithm 1, a hierarchically nested attention scheme with

(

) cost inspired by

matrices [

], which is much more eﬃcient than the vanilla attention

above that scales with

(

). The overall HANO scheme (e.g., for a three-level example see Figure

3) resembles the V-cycle operations in multigrid methods, and it comprises four key operations:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MitigatingspectralbiasforthemultiscaleoperatorlearningXinliangLiua,BoXub,ShuhaoCaoc,LeiZhangdaComputer,ElectricalandMathematicalScienceandEngineeringDivision,KingAbdullahUniversityofScienceandTechnology,Thuwal,23955,SaudiArabiabSchoolofMathematicalSciences,ShanghaiJiaoTongUniversity,Shanghai,200240,...

展开>> 收起<<

Mitigating spectral bias for the multiscale operator learning.pdf

共29页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Mitigating spectral bias for the multiscale operator learning

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: