2 PAZNER, KOLEV, AND CAMIER
assembly of the auxiliary low-order system matrices, the construction and application of algebraic multigrid
preconditioners, and the matrix-free application of the high-order operator in the context of a preconditioned
conjugate gradient iteration. For H(curl) and H(div) problems, we use the AMS [26] and ADS [27]
preconditioners, respectively, whose construction requires additional problem-specific inputs including mesh
coordinates and discrete differential operators in matrix form; these inputs are also constructed on-device. We
provide comprehensive performance studies of these algorithms, measuring runtimes, kernel throughput, and
strong and weak parallel scalability of all components of the solution algorithm. Additionally, we compare
the relative weightings of the algorithmic components on GPU and CPU, where our results show that the
computational bottlenecks for GPU-accelerated solvers are significantly different than for CPU solvers. In
particular, solver setup is relatively more expensive (as a fraction of total time to solution) on GPU compared
with CPU; however, the LOR matrix assembly algorithms proposed presently ensure that the assembly of
the low-order-refined matrix is not a bottleneck.
The first work on LOR preconditioning dates to 1980, when Orszag proposed using second-order finite
difference methods to precondition certain spectral collocation methods [36]. Since then, there has been sig-
nificant work extending this concept, making use of low-order finite element discretizations to precondition
spectral and spectral element methods [7, 8, 11, 16]. The low-order discretization is often in turn precondi-
tioned using multigrid, in particular algebraic or semi-structured multigrid methods [5, 38]. Although these
multigrid-based methods typically perform well, an attractive feature of LOR preconditioning is that any
effective preconditioner for the low-order system can be used to precondition the high-order system. LOR
preconditioning has been successfully applied to spectral element and high-order finite element discretiza-
tions of the incompressible Navier–Stokes equations [17, 18], and has been extended to other discretizations,
including discontinuous Galerkin methods with hp-refinement [39]. Recently, spectrally equivalent LOR
preconditioners for Maxwell and grad-div problems in H(curl) and H(div) were developed [12, 40].
A novel aspect of this work is the use of a macro-element batching strategy for the assembly of the
low-order-refined system matrices, and the associated algorithms and data structures. There is significant
existing work in the literature on GPU-accelerated algorithms for the matrix-free evaluation of high-order
operators [1, 18, 32]; these algorithms have been a primary focus of the CEED project targeting exascale
architectures [24]. Similarly, GPU-accelerated algebraic multigrid algorithms have been developed as part of
the hypre [13, 14] and AmgX [35] libraries, among others. However, the efficient assembly of the low-order-
refined matrices on GPUs is a topic that has not been addressed in these previous works. The algorithms
and GPU implementations described presently are shown to significantly outperform more generic matrix
assembly approaches that do not take advantage of the macro-element-level regular structure and topology
of the low-order-refined discretizations.
The structure of this paper is as follows. In Section 2, we give an overview of the LOR preconditioning
approach for problems in H1,H(curl), and H(div), including on nonconforming meshes resulting from
adaptive refinement. In Section 3, we describe the GPU algorithms for each of the algorithmic steps in the
problem setup and the solution procedure. In Section 4, we provide several detailed performance studies of
the proposed methods, algorithms, and implementations. The results include kernel-level throughput bench-
marks, parallel scaling studies, and the illustration of these techniques on a more realistic and challenging
large-scale electromagnetics problem. Open-source code availability and a brief discussion of the software
interface is provided in Section 5. We end with conclusions in Section 6.
2. Low-order-refined preconditioning
In this section, we give a brief overview of low-order-refined preconditioning for high-order finite element
problems. We begin by defining the low-order-refined mesh in Section 2.1. Low-order preconditioning for H1-
conforming discretizations of the Poisson problem is described in Section 2.2. Preconditioning of problems
in H(curl) and H(div) are discussed in Section 2.3.
2.1. The low-order-refined mesh. Let Ω ⊆Rddenote the spatial domain, d∈ {1,2,3}. The domain is
discretized using a mesh Tpof tensor-product elements, such that each element κ∈ Tpis given by the image
of the reference element bκ= [−1,1]dunder a (typically isoparametric) transformation, i.e. κ=Tκ(bκ). Let
bxi∈[−1,1] denote the p+ 1 Gauss–Lobatto points. The d-fold Cartesian product of the points {bxi}is used
as a set of Lagrange interpolation points in the reference element bκ. These points are also used to define the
low-order-refined mesh Th. Each element κ∈ Tpis subdivided into pdtopologically Cartesian subelements,