END-TO-END GPU ACCELERATION OF LOW-ORDER-REFINED PRECONDITIONING FOR HIGH-ORDER FINITE ELEMENT DISCRETIZATIONS WILL PAZNER12 TZANIO KOLEV2 AND JEANSYLVAIN CAMIER2

2025-05-06 0 0 3.88MB 23 页 10玖币
侵权投诉
END-TO-END GPU ACCELERATION OF LOW-ORDER-REFINED
PRECONDITIONING FOR HIGH-ORDER FINITE ELEMENT DISCRETIZATIONS
WILL PAZNER1,2, TZANIO KOLEV2, AND JEAN–SYLVAIN CAMIER2
Abstract. In this paper, we present algorithms and implementations for the end-to-end GPU accelera-
tion of matrix-free low-order-refined preconditioning of high-order finite element problems. The methods
described here allow for the construction of effective preconditioners for high-order problems with opti-
mal memory usage and computational complexity. The preconditioners are based on the construction of
a spectrally equivalent low-order discretization on a refined mesh, which is then amenable to, for exam-
ple, algebraic multigrid preconditioning. The constants of equivalence are independent of mesh size and
polynomial degree. For vector finite element problems in H(curl) and H(div) (e.g. for electromagnetic or
radiation diffusion problems) a specially constructed interpolation–histopolation basis is used to ensure fast
convergence. Detailed performance studies are carried out to analyze the efficiency of the GPU algorithms.
The kernel throughput of each of the main algorithmic components is measured, and the strong and weak
parallel scalability of the methods is demonstrated. The different relative weighting and significance of the
algorithmic components on GPUs and CPUs is discussed. Results on problems involving adaptively refined
nonconforming meshes are shown, and the use of the preconditioners on a large-scale magnetic diffusion
problem using all spaces of the finite element de Rham complex is illustrated.
1. Introduction
High-order finite element methods provide highly accurate solutions for problems with complex geometry
using unstructured meshes, while simultaneously achieving efficiency and scalability on modern supercom-
puting platforms, in particular those with accelerator- and GPU-based architectures [6, 23, 25]. While the
efficient evaluation of high-order finite element operators using matrix-free methods and sum factorizations
(cf. [15, 29, 36]) is well-studied, the solution of the resulting large linear systems remains challenging, in
large part because the cost of assembling the system matrix of the high-order operator is prohibitive, both
in terms of memory requirements and computation time. In this work we describe the GPU acceleration and
high-performance implementation of low-order-refined (LOR) preconditioning for high-order finite element
problems posed in H1,H(curl), and H(div) spaces. These preconditioners, also known as “FEM–SEM
preconditioners” (cf. [8, 9, 36]), are based on the idea of constructing an auxiliary, spectrally equivalent low-
order discretization, and then applying effective preconditioners (for instance, algebraic multigrid methods)
constructed using the low-order matrix, directly to the high-order system.
There are a number of other approaches for the matrix-free preconditioning of high-order finite element
problems. One common approach is p-multigrid, which involves the construction of a hierarchy of p-coarsened
spaces [20, 44]. The coarsest space is typically a lowest-order space, which can be treated by means of a
standard preconditioner for low-order finite element problems, for example algebraic multigrid methods.
On meshes that permit h-coarsening (for example, meshes that result from the successive refinement of an
initial coarse mesh), geometric multigrid methods can also be used for matrix-free preconditioning. These
two approaches can be combined in a relatively flexible manner to obtain hp-coarsened hierarchies. In
all of these methods, a smoother is required at each level of the hierarchy; typically Jacobi or Chebyshev
smoothing is performed (where the diagonal of the matrix is obtained without full matrix assembly using
sum factorization, cf. [43]); more sophisticated smoothers such as overlapping Schwarz may also be used [16].
Combinations of these techniques with low-order-refined preconditioning is also possible, see [38]. Detailed
performance comparisons of matrix-free methods for high-order discretizations are available in [24, 30, 31,
41], among others.
In this work, we develop algorithms and software implementations to effectively leverage GPU-based
computing architectures for the entire low-order-refined preconditioning algorithm. This includes the parallel
1Fariborz Maseeh Department of Mathematics and Statistics, Portland State University
2Center for Applied Scientific Computing, Lawrence Livermore National Laboratory
1
arXiv:2210.12253v1 [cs.MS] 21 Oct 2022
2 PAZNER, KOLEV, AND CAMIER
assembly of the auxiliary low-order system matrices, the construction and application of algebraic multigrid
preconditioners, and the matrix-free application of the high-order operator in the context of a preconditioned
conjugate gradient iteration. For H(curl) and H(div) problems, we use the AMS [26] and ADS [27]
preconditioners, respectively, whose construction requires additional problem-specific inputs including mesh
coordinates and discrete differential operators in matrix form; these inputs are also constructed on-device. We
provide comprehensive performance studies of these algorithms, measuring runtimes, kernel throughput, and
strong and weak parallel scalability of all components of the solution algorithm. Additionally, we compare
the relative weightings of the algorithmic components on GPU and CPU, where our results show that the
computational bottlenecks for GPU-accelerated solvers are significantly different than for CPU solvers. In
particular, solver setup is relatively more expensive (as a fraction of total time to solution) on GPU compared
with CPU; however, the LOR matrix assembly algorithms proposed presently ensure that the assembly of
the low-order-refined matrix is not a bottleneck.
The first work on LOR preconditioning dates to 1980, when Orszag proposed using second-order finite
difference methods to precondition certain spectral collocation methods [36]. Since then, there has been sig-
nificant work extending this concept, making use of low-order finite element discretizations to precondition
spectral and spectral element methods [7, 8, 11, 16]. The low-order discretization is often in turn precondi-
tioned using multigrid, in particular algebraic or semi-structured multigrid methods [5, 38]. Although these
multigrid-based methods typically perform well, an attractive feature of LOR preconditioning is that any
effective preconditioner for the low-order system can be used to precondition the high-order system. LOR
preconditioning has been successfully applied to spectral element and high-order finite element discretiza-
tions of the incompressible Navier–Stokes equations [17, 18], and has been extended to other discretizations,
including discontinuous Galerkin methods with hp-refinement [39]. Recently, spectrally equivalent LOR
preconditioners for Maxwell and grad-div problems in H(curl) and H(div) were developed [12, 40].
A novel aspect of this work is the use of a macro-element batching strategy for the assembly of the
low-order-refined system matrices, and the associated algorithms and data structures. There is significant
existing work in the literature on GPU-accelerated algorithms for the matrix-free evaluation of high-order
operators [1, 18, 32]; these algorithms have been a primary focus of the CEED project targeting exascale
architectures [24]. Similarly, GPU-accelerated algebraic multigrid algorithms have been developed as part of
the hypre [13, 14] and AmgX [35] libraries, among others. However, the efficient assembly of the low-order-
refined matrices on GPUs is a topic that has not been addressed in these previous works. The algorithms
and GPU implementations described presently are shown to significantly outperform more generic matrix
assembly approaches that do not take advantage of the macro-element-level regular structure and topology
of the low-order-refined discretizations.
The structure of this paper is as follows. In Section 2, we give an overview of the LOR preconditioning
approach for problems in H1,H(curl), and H(div), including on nonconforming meshes resulting from
adaptive refinement. In Section 3, we describe the GPU algorithms for each of the algorithmic steps in the
problem setup and the solution procedure. In Section 4, we provide several detailed performance studies of
the proposed methods, algorithms, and implementations. The results include kernel-level throughput bench-
marks, parallel scaling studies, and the illustration of these techniques on a more realistic and challenging
large-scale electromagnetics problem. Open-source code availability and a brief discussion of the software
interface is provided in Section 5. We end with conclusions in Section 6.
2. Low-order-refined preconditioning
In this section, we give a brief overview of low-order-refined preconditioning for high-order finite element
problems. We begin by defining the low-order-refined mesh in Section 2.1. Low-order preconditioning for H1-
conforming discretizations of the Poisson problem is described in Section 2.2. Preconditioning of problems
in H(curl) and H(div) are discussed in Section 2.3.
2.1. The low-order-refined mesh. Let Rddenote the spatial domain, d∈ {1,2,3}. The domain is
discretized using a mesh Tpof tensor-product elements, such that each element κ∈ Tpis given by the image
of the reference element bκ= [1,1]dunder a (typically isoparametric) transformation, i.e. κ=Tκ(bκ). Let
bxi[1,1] denote the p+ 1 Gauss–Lobatto points. The d-fold Cartesian product of the points {bxi}is used
as a set of Lagrange interpolation points in the reference element bκ. These points are also used to define the
low-order-refined mesh Th. Each element κ∈ Tpis subdivided into pdtopologically Cartesian subelements,
GPU-ACCELERATED LOR PRECONDITIONING 3
Figure 1. Top row: examples of high-order (coarse) meshes with p= 9 Gauss–Lobatto
nodes. Bottom row: corresponding low-order-refined meshes.
whose vertices are given by the image of Gauss–Lobatto points under the element transformation Tκ. Some
examples of high-order (coarse) meshes Tpand the corresponding low-order-refined meshes Thare given in
Figure 1
2.2. Poisson problem. Consider the model Poisson problem
(1) u=fin Ω,
u=gDon Ω.
To discretize this problem, define the H1-conforming degree-pfinite element space Vp,
Vp={vH1(Ω) : v|κTκ∈ Qp(bκ)},
where Qpis the space of multivariate polynomials of at most degree pin each variable. A nodal basis for Qp
is given by the Lagrange interpolating polynomials defined at the Cartesian product of the Gauss–Lobatto
quadrature points, such that the degrees of freedom of the space Vpare point values at the Gauss–Lobatto
nodes. The standard bilinear form AVp= (u, v) for u, v Vpgives rise to the high-order stiffness
matrix AVp. In general, the total number of nonzeros in this matrix will scale like O(nep2d) where neis
the number of elements in the mesh, and the number of operations required to assemble this matrix ranges
from O(nep2d+1) to O(nep3d), depending on the algorithm used [33]. This memory and computational costs
are prohibitive for moderate to large values of p, and so instead a matrix-free approach is adopted, where
the action of AVpis computed without constructing the matrix. The action can be computed using sum
factorization with optimal O(nepd) memory usage and O(nepd+1) operations [36]; this matrix-free operator
evaluation approach is the key component of the high-performance finite element libraries software library
libCEED [6] and also provides the foundation for MFEM’s partial assembly algorithm [2].
Because of the prohibitive assembly costs, the matrix AVpis unavailable for preconditioner construction.
Instead, we construct an auxiliary low-order discretization matrix that is spectrally equivalent to the high-
order matrix. The low-order-refined space Vhis defined to be the p= 1 H1-conforming space on the low-
order-refined mesh Th, whose construction is described in Section 2.1. The degrees of freedom of Vhcoincide
with the degrees of freedom of the high-order space Vp. In both cases, the degrees of freedom represent
point values at the Gauss–Lobatto points; in the high-order case, the associated basis functions are degree-p
polynomials defined on the elements of the coarse mesh Tp, and in the low-order case, the associated basis
functions are the standard “hat functions” defined on the refined mesh Th. Given the low-order space Vh,
it is possible to assemble the low-order-refined stiffness matrix AVh. We briefly summarize some important
properties of the matrices AVhand AVp.
4 PAZNER, KOLEV, AND CAMIER
Since the high-order and low-order-refined degrees of freedom coincide, AVhand AVpare matrices of
the same size.
The low-order matrix AVhhas O(1) nonzeros per row. On the other hand, the high-order matrix
AVphas O(p2d) nonzeros per row.
AVhand AVpare spectrally equivalent, independent of p. This spectral equivalence is often known
as FEM–SEM equivalence, and was proven in [7] and [37].
Since AVhand AVpare spectrally equivalent, any uniform preconditioner Bhof AVh(i.e. such that the
condition number of BhAVhis bounded independently of problem size and other discretization parameters)
will also be an effective preconditioner for AVp. In this work, we take Bhto be one V-cycle of algebraic
multigrid (AMG) constructed using the assembled matrix AVh. Other choices for Bhare possible, including
geometric or semi-structured multigrid, domain decomposition, or even sparse direct solvers, but in this
work we restrict ourselves to AMG methods. Note that the same considerations regarding sparsity also
apply to the high-order and low-order-refined mass matrices (denoted MVpand MVh, respectively). These
mass matrices are spectrally equivalent (with constants of equivalence independent of polynomial degree
p), and so low-order-refined preconditioning can be also applied to problems involving non-negative linear
combinations of the mass and stiffness matrices.
2.3. Vector finite elements. While the approach described above can be used to obtain spectrally equiv-
alent low-order-refined preconditioners using nodal H1discretizations for Poisson problems, the situation is
more delicate for Maxwell and grad-div problems in H(curl) and H(div). In these spaces, nodal bases using
Gauss–Lobatto or Gauss–Legendre points do not give rise to spectrally equivalent low-order discretizations,
and as a result, the condition number of the preconditioned system grows rapidly, leading to large iteration
counts.
Instead, an interpolation–histopolation basis for the high-order H(curl) and H(div) spaces must be used
[12, 40]. The resulting basis functions are piecewise polynomials that take on prescribed mean values over
subcell edges (in the case of H(curl)) or subcell faces (in the case of H(div)). These basis functions were
introduced in [28] in the context of mimetic methods. When this basis is used, spectral equivalence of the
high-order and low-order-refined stiffness matrices (and mass matrices) are recovered for vector finite element
spaces. The construction of low-order-refined preconditioners for this case is discussed at length in [40]. We
note that the lowest-order case of the interpolation–histopolation bases reduces exactly to the standard
lowest-order N´ed´elec and Raviart–Thomas elements. As in the case of the Poisson problem, any effective
preconditioner constructed using the low-order-refined system matrix can be used to effectively precondition
the high-order system. However, even in the low-order case, the construction of preconditioners for these
problems is more challenging. Typically, multigrid methods with specialized smoothers [3] or auxiliary space
methods [22] are required to achieve good convergence. Analogous to the use of algebraic multigrid for the
Poisson problem, in this work we focus on the use of auxiliary space AMG methods for these problems;
in particular, we use the AMG-based algebraic Maxwell solver (AMS) for H(curl) problems [26], and the
auxiliary divergence solver (ADS) for H(div) problems [27]. These solvers are based on combining auxiliary
space methods with algebraic multigrid, and are designed to work as black-box methods, requiring relatively
little discretization information beyond the assembled system matrix. The AMS H(curl) solver requires
two additional inputs: a vector of mesh vertex coordinates (used to construct the interpolation operator
mapping the vector H1to the N´ed´elec space; this interpolation matrix can also be provided directly), and
a discrete gradient matrix. In addition to the discrete gradient and interpolation, the ADS H(div) solver
additionally requires a discrete curl matrix, mapping from H(curl) into H(div). The construction of these
discrete matrices is also discussed in the following sections.
2.4. Adaptive mesh refinement and hanging node constraints. In the case of meshes with hanging
nodes (e.g. those resulting from nonconforming adaptive mesh refinement), the construction of the appro-
priate low-order-refined discretization is less obvious. In this case, the refinement procedure described in
Section 2.1 to obtain the low-order-refined mesh results in non-matching interfaces that do not directly lend
themselves to the construction of spectrally equivalent low-order discretizations, see Figure 2.
In [39], an approach for constructing low-order-refined preconditioners for discretizations posed on meshes
with hanging nodes based on variational restriction was proposed. In this approach, the high-order system
GPU-ACCELERATED LOR PRECONDITIONING 5
(a) (b) (c) (d)
(a) Mesh with three elements and one hanging node.
(b) Low-order-refined mesh corresponding to p= 5.
(c) Adaptively refined mesh.
(d) Low-order-refined mesh corresponding to p= 3.
Figure 2. Illustration of nonconforming meshes with hanging nodes and their low-order-
refined counterparts.
is written as
(2) Ap= ΛTb
ApΛ,
where Λ is the matrix that enforces the constraints at element interfaces, and b
Apis a block-diagonal matrix
acting on vectors of unconstrained (duplicated) degrees of freedom with element matrices b
Ap,i on the diagonal.
The matrix Λ is often known as an assembly matrix; in the case where the mesh contains no hanging nodes
(and without p-refinement), the matrix Λ is a boolean matrix encoding the local (elementwise) to global
DOF mapping. When the mesh contains hanging nodes, constraints must be enforced at the nonconforming
interfaces, resulting in a more complicated Λ matrix; see also [10] for more details.
The low-order-refined counterpart b
Ah,i of each element matrix b
Ap,i can be assembled to obtain a block-
diagonal matrix b
Ah. The diagonal blocks of b
Ahare sparse, whereas the blocks of b
Apare in general dense.
Then, a low-order-refined preconditioner can be formed by
(3) Ah= ΛTb
AhΛ,
where the same matrix Λ is used in (2) and (3). In the case of conforming meshes, the matrix Ahdefined by (3)
is identical to that obtained through the process described in Section 2.2. When the mesh is nonconforming,
a simple variational argument shows that Ahis spectrally equivalent to Ap, and therefore can be used as a
preconditioner. The implementation of such preconditioners requires only the construction of the constraint
matrix Λ, and the computation of the triple product in (3). Then, AMG or other algebraic preconditioners
can be constructed using the assembled matrix Ah, and used to precondition the high-order system defined
by Ap.
3. GPU algorithms
The LOR-based solution procedure can broadly be divided into two phases. The setup phase, performed
before the beginning of the conjugate gradient iteration, consists of problem setup, matrix assembly, and
construction of the algebraic multigrid preconditioner. The solve phase consists of the preconditioned con-
jugate gradient iteration, which at every step requires the matrix-free application of the high-order operator
and the application of the algebraic multigrid V-cycle. In this section we describe the approach for GPU-
accelerated algorithms for each of these components. We summarize the algorithmic steps that make up the
solution algorithm:
S1. Problem setup:
S1.1. High-order operator setup;
S1.2. Low-order-refined matrix assembly;
S1.3. Algebraic multigrid setup.
S2. Preconditioned conjugate gradient, including at every iteration:
S2.1. High-order operator evaluation;
S2.2. Algebraic multigrid V-cycle.
摘要:

END-TO-ENDGPUACCELERATIONOFLOW-ORDER-REFINEDPRECONDITIONINGFORHIGH-ORDERFINITEELEMENTDISCRETIZATIONSWILLPAZNER1;2,TZANIOKOLEV2,ANDJEAN{SYLVAINCAMIER2Abstract.Inthispaper,wepresentalgorithmsandimplementationsfortheend-to-endGPUaccelera-tionofmatrix-freelow-order-re nedpreconditioningofhigh-order nite...

展开>> 收起<<
END-TO-END GPU ACCELERATION OF LOW-ORDER-REFINED PRECONDITIONING FOR HIGH-ORDER FINITE ELEMENT DISCRETIZATIONS WILL PAZNER12 TZANIO KOLEV2 AND JEANSYLVAIN CAMIER2.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:3.88MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注