END-TO-END GPU ACCELERATION OF LOW-ORDER-REFINED PRECONDITIONING FOR HIGH-ORDER FINITE ELEMENT DISCRETIZATIONS WILL PAZNER12 TZANIO KOLEV2 AND JEANSYLVAIN CAMIER2

2025-05-06 0 0 3.88MB 23 页 10玖币

侵权投诉

END-TO-END GPU ACCELERATION OF LOW-ORDER-REFINED

PRECONDITIONING FOR HIGH-ORDER FINITE ELEMENT DISCRETIZATIONS

WILL PAZNER1,2, TZANIO KOLEV2, AND JEAN–SYLVAIN CAMIER2

Abstract. In this paper, we present algorithms and implementations for the end-to-end GPU accelera-

tion of matrix-free low-order-reﬁned preconditioning of high-order ﬁnite element problems. The methods

described here allow for the construction of eﬀective preconditioners for high-order problems with opti-

mal memory usage and computational complexity. The preconditioners are based on the construction of

a spectrally equivalent low-order discretization on a reﬁned mesh, which is then amenable to, for exam-

ple, algebraic multigrid preconditioning. The constants of equivalence are independent of mesh size and

polynomial degree. For vector ﬁnite element problems in H(curl) and H(div) (e.g. for electromagnetic or

radiation diﬀusion problems) a specially constructed interpolation–histopolation basis is used to ensure fast

convergence. Detailed performance studies are carried out to analyze the eﬃciency of the GPU algorithms.

The kernel throughput of each of the main algorithmic components is measured, and the strong and weak

parallel scalability of the methods is demonstrated. The diﬀerent relative weighting and signiﬁcance of the

algorithmic components on GPUs and CPUs is discussed. Results on problems involving adaptively reﬁned

nonconforming meshes are shown, and the use of the preconditioners on a large-scale magnetic diﬀusion

problem using all spaces of the ﬁnite element de Rham complex is illustrated.

1. Introduction

High-order ﬁnite element methods provide highly accurate solutions for problems with complex geometry

using unstructured meshes, while simultaneously achieving eﬃciency and scalability on modern supercom-

puting platforms, in particular those with accelerator- and GPU-based architectures [6, 23, 25]. While the

eﬃcient evaluation of high-order ﬁnite element operators using matrix-free methods and sum factorizations

(cf. [15, 29, 36]) is well-studied, the solution of the resulting large linear systems remains challenging, in

large part because the cost of assembling the system matrix of the high-order operator is prohibitive, both

in terms of memory requirements and computation time. In this work we describe the GPU acceleration and

high-performance implementation of low-order-reﬁned (LOR) preconditioning for high-order ﬁnite element

problems posed in H1,H(curl), and H(div) spaces. These preconditioners, also known as “FEM–SEM

preconditioners” (cf. [8, 9, 36]), are based on the idea of constructing an auxiliary, spectrally equivalent low-

order discretization, and then applying eﬀective preconditioners (for instance, algebraic multigrid methods)

constructed using the low-order matrix, directly to the high-order system.

There are a number of other approaches for the matrix-free preconditioning of high-order ﬁnite element

problems. One common approach is p-multigrid, which involves the construction of a hierarchy of p-coarsened

spaces [20, 44]. The coarsest space is typically a lowest-order space, which can be treated by means of a

standard preconditioner for low-order ﬁnite element problems, for example algebraic multigrid methods.

On meshes that permit h-coarsening (for example, meshes that result from the successive reﬁnement of an

initial coarse mesh), geometric multigrid methods can also be used for matrix-free preconditioning. These

two approaches can be combined in a relatively ﬂexible manner to obtain hp-coarsened hierarchies. In

all of these methods, a smoother is required at each level of the hierarchy; typically Jacobi or Chebyshev

smoothing is performed (where the diagonal of the matrix is obtained without full matrix assembly using

sum factorization, cf. [43]); more sophisticated smoothers such as overlapping Schwarz may also be used [16].

Combinations of these techniques with low-order-reﬁned preconditioning is also possible, see [38]. Detailed

performance comparisons of matrix-free methods for high-order discretizations are available in [24, 30, 31,

41], among others.

In this work, we develop algorithms and software implementations to eﬀectively leverage GPU-based

computing architectures for the entire low-order-reﬁned preconditioning algorithm. This includes the parallel

1Fariborz Maseeh Department of Mathematics and Statistics, Portland State University

2Center for Applied Scientific Computing, Lawrence Livermore National Laboratory

arXiv:2210.12253v1 [cs.MS] 21 Oct 2022

2 PAZNER, KOLEV, AND CAMIER

assembly of the auxiliary low-order system matrices, the construction and application of algebraic multigrid

preconditioners, and the matrix-free application of the high-order operator in the context of a preconditioned

conjugate gradient iteration. For H(curl) and H(div) problems, we use the AMS [26] and ADS [27]

preconditioners, respectively, whose construction requires additional problem-speciﬁc inputs including mesh

coordinates and discrete diﬀerential operators in matrix form; these inputs are also constructed on-device. We

provide comprehensive performance studies of these algorithms, measuring runtimes, kernel throughput, and

strong and weak parallel scalability of all components of the solution algorithm. Additionally, we compare

the relative weightings of the algorithmic components on GPU and CPU, where our results show that the

computational bottlenecks for GPU-accelerated solvers are signiﬁcantly diﬀerent than for CPU solvers. In

particular, solver setup is relatively more expensive (as a fraction of total time to solution) on GPU compared

with CPU; however, the LOR matrix assembly algorithms proposed presently ensure that the assembly of

the low-order-reﬁned matrix is not a bottleneck.

The ﬁrst work on LOR preconditioning dates to 1980, when Orszag proposed using second-order ﬁnite

diﬀerence methods to precondition certain spectral collocation methods [36]. Since then, there has been sig-

niﬁcant work extending this concept, making use of low-order ﬁnite element discretizations to precondition

spectral and spectral element methods [7, 8, 11, 16]. The low-order discretization is often in turn precondi-

tioned using multigrid, in particular algebraic or semi-structured multigrid methods [5, 38]. Although these

multigrid-based methods typically perform well, an attractive feature of LOR preconditioning is that any

eﬀective preconditioner for the low-order system can be used to precondition the high-order system. LOR

preconditioning has been successfully applied to spectral element and high-order ﬁnite element discretiza-

tions of the incompressible Navier–Stokes equations [17, 18], and has been extended to other discretizations,

including discontinuous Galerkin methods with hp-reﬁnement [39]. Recently, spectrally equivalent LOR

preconditioners for Maxwell and grad-div problems in H(curl) and H(div) were developed [12, 40].

A novel aspect of this work is the use of a macro-element batching strategy for the assembly of the

low-order-reﬁned system matrices, and the associated algorithms and data structures. There is signiﬁcant

existing work in the literature on GPU-accelerated algorithms for the matrix-free evaluation of high-order

operators [1, 18, 32]; these algorithms have been a primary focus of the CEED project targeting exascale

architectures [24]. Similarly, GPU-accelerated algebraic multigrid algorithms have been developed as part of

the hypre [13, 14] and AmgX [35] libraries, among others. However, the eﬃcient assembly of the low-order-

reﬁned matrices on GPUs is a topic that has not been addressed in these previous works. The algorithms

and GPU implementations described presently are shown to signiﬁcantly outperform more generic matrix

assembly approaches that do not take advantage of the macro-element-level regular structure and topology

of the low-order-reﬁned discretizations.

The structure of this paper is as follows. In Section 2, we give an overview of the LOR preconditioning

approach for problems in H1,H(curl), and H(div), including on nonconforming meshes resulting from

adaptive reﬁnement. In Section 3, we describe the GPU algorithms for each of the algorithmic steps in the

problem setup and the solution procedure. In Section 4, we provide several detailed performance studies of

the proposed methods, algorithms, and implementations. The results include kernel-level throughput bench-

marks, parallel scaling studies, and the illustration of these techniques on a more realistic and challenging

large-scale electromagnetics problem. Open-source code availability and a brief discussion of the software

interface is provided in Section 5. We end with conclusions in Section 6.

2. Low-order-refined preconditioning

In this section, we give a brief overview of low-order-reﬁned preconditioning for high-order ﬁnite element

problems. We begin by deﬁning the low-order-reﬁned mesh in Section 2.1. Low-order preconditioning for H1-

conforming discretizations of the Poisson problem is described in Section 2.2. Preconditioning of problems

in H(curl) and H(div) are discussed in Section 2.3.

2.1. The low-order-reﬁned mesh. Let Ω ⊆Rddenote the spatial domain, d∈ {1,2,3}. The domain is

discretized using a mesh Tpof tensor-product elements, such that each element κ∈ Tpis given by the image

of the reference element bκ= [−1,1]dunder a (typically isoparametric) transformation, i.e. κ=Tκ(bκ). Let

bxi∈[−1,1] denote the p+ 1 Gauss–Lobatto points. The d-fold Cartesian product of the points {bxi}is used

as a set of Lagrange interpolation points in the reference element bκ. These points are also used to deﬁne the

low-order-reﬁned mesh Th. Each element κ∈ Tpis subdivided into pdtopologically Cartesian subelements,

GPU-ACCELERATED LOR PRECONDITIONING 3

Figure 1. Top row: examples of high-order (coarse) meshes with p= 9 Gauss–Lobatto

nodes. Bottom row: corresponding low-order-reﬁned meshes.

whose vertices are given by the image of Gauss–Lobatto points under the element transformation Tκ. Some

examples of high-order (coarse) meshes Tpand the corresponding low-order-reﬁned meshes Thare given in

Figure 1

2.2. Poisson problem. Consider the model Poisson problem

(1) −∆u=fin Ω,

u=gDon ∂Ω.

To discretize this problem, deﬁne the H1-conforming degree-pﬁnite element space Vp,

Vp={v∈H1(Ω) : v|κ◦Tκ∈ Qp(bκ)},

where Qpis the space of multivariate polynomials of at most degree pin each variable. A nodal basis for Qp

is given by the Lagrange interpolating polynomials deﬁned at the Cartesian product of the Gauss–Lobatto

quadrature points, such that the degrees of freedom of the space Vpare point values at the Gauss–Lobatto

nodes. The standard bilinear form AVp= (∇u, ∇v) for u, v ∈Vpgives rise to the high-order stiﬀness

matrix AVp. In general, the total number of nonzeros in this matrix will scale like O(nep2d) where neis

the number of elements in the mesh, and the number of operations required to assemble this matrix ranges

from O(nep2d+1) to O(nep3d), depending on the algorithm used [33]. This memory and computational costs

are prohibitive for moderate to large values of p, and so instead a matrix-free approach is adopted, where

the action of AVpis computed without constructing the matrix. The action can be computed using sum

factorization with optimal O(nepd) memory usage and O(nepd+1) operations [36]; this matrix-free operator

evaluation approach is the key component of the high-performance ﬁnite element libraries software library

libCEED [6] and also provides the foundation for MFEM’s partial assembly algorithm [2].

Because of the prohibitive assembly costs, the matrix AVpis unavailable for preconditioner construction.

Instead, we construct an auxiliary low-order discretization matrix that is spectrally equivalent to the high-

order matrix. The low-order-reﬁned space Vhis deﬁned to be the p= 1 H1-conforming space on the low-

order-reﬁned mesh Th, whose construction is described in Section 2.1. The degrees of freedom of Vhcoincide

with the degrees of freedom of the high-order space Vp. In both cases, the degrees of freedom represent

point values at the Gauss–Lobatto points; in the high-order case, the associated basis functions are degree-p

polynomials deﬁned on the elements of the coarse mesh Tp, and in the low-order case, the associated basis

functions are the standard “hat functions” deﬁned on the reﬁned mesh Th. Given the low-order space Vh,

it is possible to assemble the low-order-reﬁned stiﬀness matrix AVh. We brieﬂy summarize some important

properties of the matrices AVhand AVp.

4 PAZNER, KOLEV, AND CAMIER

•Since the high-order and low-order-reﬁned degrees of freedom coincide, AVhand AVpare matrices of

the same size.

•The low-order matrix AVhhas O(1) nonzeros per row. On the other hand, the high-order matrix

AVphas O(p2d) nonzeros per row.

•AVhand AVpare spectrally equivalent, independent of p. This spectral equivalence is often known

as FEM–SEM equivalence, and was proven in [7] and [37].

Since AVhand AVpare spectrally equivalent, any uniform preconditioner Bhof AVh(i.e. such that the

condition number of BhAVhis bounded independently of problem size and other discretization parameters)

will also be an eﬀective preconditioner for AVp. In this work, we take Bhto be one V-cycle of algebraic

multigrid (AMG) constructed using the assembled matrix AVh. Other choices for Bhare possible, including

geometric or semi-structured multigrid, domain decomposition, or even sparse direct solvers, but in this

work we restrict ourselves to AMG methods. Note that the same considerations regarding sparsity also

apply to the high-order and low-order-reﬁned mass matrices (denoted MVpand MVh, respectively). These

mass matrices are spectrally equivalent (with constants of equivalence independent of polynomial degree

p), and so low-order-reﬁned preconditioning can be also applied to problems involving non-negative linear

combinations of the mass and stiﬀness matrices.

2.3. Vector ﬁnite elements. While the approach described above can be used to obtain spectrally equiv-

alent low-order-reﬁned preconditioners using nodal H1discretizations for Poisson problems, the situation is

more delicate for Maxwell and grad-div problems in H(curl) and H(div). In these spaces, nodal bases using

Gauss–Lobatto or Gauss–Legendre points do not give rise to spectrally equivalent low-order discretizations,

and as a result, the condition number of the preconditioned system grows rapidly, leading to large iteration

counts.

Instead, an interpolation–histopolation basis for the high-order H(curl) and H(div) spaces must be used

[12, 40]. The resulting basis functions are piecewise polynomials that take on prescribed mean values over

subcell edges (in the case of H(curl)) or subcell faces (in the case of H(div)). These basis functions were

introduced in [28] in the context of mimetic methods. When this basis is used, spectral equivalence of the

high-order and low-order-reﬁned stiﬀness matrices (and mass matrices) are recovered for vector ﬁnite element

spaces. The construction of low-order-reﬁned preconditioners for this case is discussed at length in [40]. We

note that the lowest-order case of the interpolation–histopolation bases reduces exactly to the standard

lowest-order N´ed´elec and Raviart–Thomas elements. As in the case of the Poisson problem, any eﬀective

preconditioner constructed using the low-order-reﬁned system matrix can be used to eﬀectively precondition

the high-order system. However, even in the low-order case, the construction of preconditioners for these

problems is more challenging. Typically, multigrid methods with specialized smoothers [3] or auxiliary space

methods [22] are required to achieve good convergence. Analogous to the use of algebraic multigrid for the

Poisson problem, in this work we focus on the use of auxiliary space AMG methods for these problems;

in particular, we use the AMG-based algebraic Maxwell solver (AMS) for H(curl) problems [26], and the

auxiliary divergence solver (ADS) for H(div) problems [27]. These solvers are based on combining auxiliary

space methods with algebraic multigrid, and are designed to work as black-box methods, requiring relatively

little discretization information beyond the assembled system matrix. The AMS H(curl) solver requires

two additional inputs: a vector of mesh vertex coordinates (used to construct the interpolation operator

mapping the vector H1to the N´ed´elec space; this interpolation matrix can also be provided directly), and

a discrete gradient matrix. In addition to the discrete gradient and interpolation, the ADS H(div) solver

additionally requires a discrete curl matrix, mapping from H(curl) into H(div). The construction of these

discrete matrices is also discussed in the following sections.

2.4. Adaptive mesh reﬁnement and hanging node constraints. In the case of meshes with hanging

nodes (e.g. those resulting from nonconforming adaptive mesh reﬁnement), the construction of the appro-

priate low-order-reﬁned discretization is less obvious. In this case, the reﬁnement procedure described in

Section 2.1 to obtain the low-order-reﬁned mesh results in non-matching interfaces that do not directly lend

themselves to the construction of spectrally equivalent low-order discretizations, see Figure 2.

In [39], an approach for constructing low-order-reﬁned preconditioners for discretizations posed on meshes

with hanging nodes based on variational restriction was proposed. In this approach, the high-order system

GPU-ACCELERATED LOR PRECONDITIONING 5

(a) (b) (c) (d)

(a) Mesh with three elements and one hanging node.

(b) Low-order-reﬁned mesh corresponding to p= 5.

(d) Low-order-reﬁned mesh corresponding to p= 3.

Figure 2. Illustration of nonconforming meshes with hanging nodes and their low-order-

reﬁned counterparts.

is written as

(2) Ap= ΛTb

ApΛ,

where Λ is the matrix that enforces the constraints at element interfaces, and b

Apis a block-diagonal matrix

acting on vectors of unconstrained (duplicated) degrees of freedom with element matrices b

Ap,i on the diagonal.

The matrix Λ is often known as an assembly matrix; in the case where the mesh contains no hanging nodes

(and without p-reﬁnement), the matrix Λ is a boolean matrix encoding the local (elementwise) to global

DOF mapping. When the mesh contains hanging nodes, constraints must be enforced at the nonconforming

interfaces, resulting in a more complicated Λ matrix; see also [10] for more details.

The low-order-reﬁned counterpart b

Ah,i of each element matrix b

Ap,i can be assembled to obtain a block-

diagonal matrix b

Ah. The diagonal blocks of b

Ahare sparse, whereas the blocks of b

Apare in general dense.

Then, a low-order-reﬁned preconditioner can be formed by

(3) Ah= ΛTb

AhΛ,

where the same matrix Λ is used in (2) and (3). In the case of conforming meshes, the matrix Ahdeﬁned by (3)

is identical to that obtained through the process described in Section 2.2. When the mesh is nonconforming,

a simple variational argument shows that Ahis spectrally equivalent to Ap, and therefore can be used as a

preconditioner. The implementation of such preconditioners requires only the construction of the constraint

matrix Λ, and the computation of the triple product in (3). Then, AMG or other algebraic preconditioners

can be constructed using the assembled matrix Ah, and used to precondition the high-order system deﬁned

by Ap.

3. GPU algorithms

The LOR-based solution procedure can broadly be divided into two phases. The setup phase, performed

before the beginning of the conjugate gradient iteration, consists of problem setup, matrix assembly, and

construction of the algebraic multigrid preconditioner. The solve phase consists of the preconditioned con-

jugate gradient iteration, which at every step requires the matrix-free application of the high-order operator

and the application of the algebraic multigrid V-cycle. In this section we describe the approach for GPU-

accelerated algorithms for each of these components. We summarize the algorithmic steps that make up the

solution algorithm:

S1. Problem setup:

S1.1. High-order operator setup;

S1.2. Low-order-reﬁned matrix assembly;

S1.3. Algebraic multigrid setup.

S2. Preconditioned conjugate gradient, including at every iteration:

S2.1. High-order operator evaluation;

S2.2. Algebraic multigrid V-cycle.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

END-TO-ENDGPUACCELERATIONOFLOW-ORDER-REFINEDPRECONDITIONINGFORHIGH-ORDERFINITEELEMENTDISCRETIZATIONSWILLPAZNER1;2,TZANIOKOLEV2,ANDJEAN{SYLVAINCAMIER2Abstract.Inthispaper,wepresentalgorithmsandimplementationsfortheend-to-endGPUaccelera-tionofmatrix-freelow-order-renedpreconditioningofhigh-ordernite...

展开>> 收起<<

END-TO-END GPU ACCELERATION OF LOW-ORDER-REFINED PRECONDITIONING FOR HIGH-ORDER FINITE ELEMENT DISCRETIZATIONS WILL PAZNER12 TZANIO KOLEV2 AND JEANSYLVAIN CAMIER2.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

END-TO-END GPU ACCELERATION OF LOW-ORDER-REFINED PRECONDITIONING FOR HIGH-ORDER FINITE ELEMENT DISCRETIZATIONS WILL PAZNER12 TZANIO KOLEV2 AND JEANSYLVAIN CAMIER2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: