From Task-Based GPU Work Aggregation to Stellar Mergers Turning Fine-Grained CPU Tasks into Portable GPU Kernels

2025-04-27 0 0 341.43KB 12 页 10玖币
侵权投诉
From Task-Based GPU Work Aggregation to Stellar
Mergers: Turning Fine-Grained CPU Tasks into
Portable GPU Kernels
Gregor Daiß , Patrick Diehl §, Dominic Marcello, Alireza Kheirkhahan, Hartmut Kaiser, Dirk Pfl¨
uger
LSU Center for Computation & Technology, Louisiana State University, Baton Rouge, LA, 70803 U.S.A
IPVS, University of Stuttgart, Stuttgart, 70174 Stuttgart, Germany
Email: Gregor.Daiss@ipvs.uni-stuttgart.de
§Department of Physics and Astronomy, Louisiana State University, Baton Rouge, LA, 70803 U.S.A.
Abstract—Meeting both scalability and performance porta-
bility requirements is a challenge for any HPC application,
especially for adaptively refined ones. In Octo-Tiger, an as-
trophysics application for the simulation of stellar mergers,
we approach this with existing solutions: We employ HPX to
obtain fine-grained tasks to easily distribute work and finely
overlap communication and computation. For the computations
themselves, we use Kokkos to turn these tasks into compute
kernels capable of running on hardware ranging from a few
CPU cores to powerful accelerators. There is a missing link,
however: while the fine-grained parallelism exposed by HPX
is useful for scalability, it can hinder GPU performance when
the tasks become too small to saturate the device, causing low
resource utilization. To bridge this gap, we investigate multiple
different GPU work aggregation strategies within Octo-Tiger,
adding one new strategy, and evaluate the node-level performance
impact on recent AMD and NVIDIA GPUs, achieving noticeable
speedups.
Index Terms—HPX, HIP, CUDA, Kokkos, Work Aggregation,
Performance Portability, Task-based Programming.
I. INTRODUCTION
Currently, developers of High-Performance-Computing
(HPC) applications are faced with a diverse set of supercom-
puters: Machines like Fugaku contain a massive amount of
compute nodes (158,976), though each node is comparatively
weak with just 48 CPU cores. In contrast, machines like
Perlmutter have by far fewer (1,536), but much more powerful
compute nodes, each containing four NVIDIA®A100 GPUs.
This heterogeneity will further be increased by future ma-
chines like Frontier (using AMD®GPUs) and Aurora (using
Intel®GPUs), highlighting the importance of both scalability
and performance-portability for any HPC application.
One such application is Octo-Tiger, a C++ astrophysics
code, used to study stellar mergers [1]. It is built upon
HPX [2], [3], a distributed asynchronous many-task runtime
system (AMT). HPX allows us to distribute work as fine-
grained tasks, easily overlapping computations and communi-
cation by defining their dependencies and letting the runtime
system handle the concurrency.
However, the very fine-grained tasks that help to scale
Octo-Tiger to thousands of nodes cause difficulties regarding
performance-portability when running the application on GPU
platforms. On the one hand, the fine-grained tasks (usually
suited for one CPU core) allow us to express the maximum
amount of parallelism with HPX, to best enable the overlap of
computation and communications for widely distributed runs,
and to more easily handle small tasks/workloads occurring due
to the simulation’s adaptive mesh refinement. On the other
hand, to properly utilize GPUs, we need enough work items
per GPU kernel to both scale to all compute units (for instance
120 on an MI100 GPU) and to have sufficient resident work
items per compute unit to properly hide latencies. In short,
we would like to have large enough tasks to turn into GPU
kernels that do not starve the device. Faced with the conflict
between both fine-grained and large tasks, it is not enough to
merely have compute kernels that are capable of running both
on CPUs and GPUs. To achieve optimal performance, it is
rather necessary to dynamically adjust the workload per task
depending on what device it should run on.
In this work, we investigate and compare three work ag-
gregation strategies to increase the workload per compute
kernel within Octo-Tiger, turning fine-grained tasks designed
for distributed CPU scenarios into a good match for GPUs.
Firstly, one conventional approach is to simply subdivide
the overall discretization of an application into larger sub-
problems when building for GPUs than we would for a CPU
build, hence increasing the workload per kernel. In the context
of this work, we will refer to this approach as ”strategy 1”.
Secondly, one can use the ability of GPUs to run multiple
independent kernels concurrently, hence relying on the GPU’s
runtime to implicitly aggregate the kernels on the device
side to avoid low resource utilization (”strategy 2”). This
approach heavily depends on the abilities of the GPU runtime
to handle large amounts of small kernels, as well as the
ability of the application itself to launch said kernels with
little overhead. Lastly, instead of relying on the GPU runtime
to run independent kernels in parallel as for strategy 2, we
can use explicit work aggregation. Thus, we aim to aggregate
similar but independent kernels combined on-the-fly into one
compute kernel whenever the GPU is starved (”strategy 3”).
arXiv:2210.06438v2 [cs.DC] 4 Mar 2023
Past work porting HPX-based codes such as Octo-Tiger
to GPU platforms relied on the second strategy: The
GPU implementation of OctoTiger was first introduced
in [4] for Octo-Tiger’s gravity solver. It combines multiple
NVIDIA®CUDA®streams together with an HPX-CUDA
integration treating CUDA kernels as HPX tasks, achieving
good performance.
In turn, in [5] a similar CUDA implementation of Octo-
Tiger’s second major module, the hydrodynamics solver, was
added. It used the same work aggregation strategy as the
gravity solver, however, we showed that increasing the work
size per kernel launch (by subdividing the grid into larger-than-
usual sub-grids, reminiscent of strategy 1) yielded a further
node-level speedup for this solver. This indicates that the work-
load per compute kernel is still problematic in the new hydro
solver implementation when just using strategy 2. Increasing
the sub-grid size using strategy 1 further came at the expense
of scalability and adaptive refinement, hence motivating us to
look for alternative work aggregation strategies beyond these
first two strategies.
In this work, we improve upon the state-of-the-art and
introduce an implementation for the explicit work aggregation
strategy (strategy 3), building on HPX and its accelerator
support. This allows us to launch GPU kernels through a
special executor that enables the aggregation of kernels into
larger kernels on-the-fly.
Additionally, we are moving from CUDA to Kokkos, pro-
viding performance portability across different systems [6].
The existing HPX-Kokkos integration layer that allows us to
treat Kokkos kernels as HPX tasks and that allows HPX worker
threads to execute Kokkos kernels, is effectively moving away
from the Fork-Join model [7]. This has already been used for
an implementation of the gravity solver. To achieve strategy
3 with the help of both HPX and Kokkos, we extend the
hydrodynamics solver by a similar Kokkos implementation in
this work. For a fair comparison on AMD GPUs we also added
a HIP version, by simply reusing the CUDA kernels using the
appropriate HIP API calls.
This means that in this work we compare three different
kernel implementations of the hydrodynamics solver: CUDA,
HIP, and Kokkos. For each implementation, we look at results
from all three aforementioned GPU work aggregation strate-
gies. Testing the Octo-Tiger node-level performance on both
an A100 and a MI100 GPU, we show that a combination
of strategies is vastly superior to the current status-quo of
just a single one: We obtain clear speedups on both devices
for Octo-Tiger. We further show that the strategies exhibit
different performance behavior depending on the GPU vendor,
highlighting the need of having alternative strategies at hand
for performance-portability if work aggregation is needed.
Overall, this work has three main contributions: 1) The
novel on-the-fly work aggregation executor (implementing
strategy 3), 2) the implementation of the hydrodynamics
module in Kokkos, and 3) a thorough comparison of the
new work aggregation strategy with the two existing ones
using both the new Kokkos hydro kernels and their previous
CUDA counterparts. For Octo-Tiger itself, our contributions
lead to a significant speedup. Beyond Octo-tiger, both the new
aggregation executor and the insights gained by comparing
different GPU work aggregation strategies can be used to find
optimal strategies in other HPX applications.
The remainder of this paper is structured as follows:
Section II relates work regarding task-based programming
frameworks to GPU support. Section III introduces HPX, Sec-
tion IV the scientific application Octo-Tiger. The three work
aggregation strategies are introduced in Section V, followed
by results and their extensive comparison on different systems
in Section VI.
II. RELATED WORK
In this section, we restrict the overview to AMTs with
accelerator support, namely CUDA, HIP, and Kokkos [8].
For a more general overview of AMTs, we refer to [9].
Table I summarizes the accelerator support. For the AMTs
supporting accelerator support, all support NVIDIA GPUs
using CUDA [4], [10]–[14]. The support of AMD GPUs
via HIP is provided by HPX solely. Uinath supports AMD
GPUs via the Kokkos backend [15]. In addition to CUDA and
HIP, HPX provides Kokkos support [7]. Most AMT support
acceleration cards nowadays. Now let us have a look into
the support of work aggregation. Legion showed aggregation
of memory bandwidth of multiple GPUs for Graph Process-
ing [16]. In addition, a novel dynamic load balancing strategy
that is cheap and achieves good load balance across GPUs is
presented. For Chapel, a GPUIterator [17], which supports
hybrid execution of parallel loops across CPUs and GPUs, is
available. However, these solutions are unlike our work, as
we use a bottom-up approach, aggregating small HPX tasks
on-the-fly into larger GPU kernels.
III. C++ STANDARD LIBRARY FOR PARALLELISM AND
CONCURRENCY
One asynchronous many-task system runtime system with
distributed capabilities is the C++ standard library for par-
allelism and concurrency, HPX [2]. One major difference of
HPX from other AMTs is that HPX’s API is fully conform-
ing with the recent and upcoming C++ standard [27]–[30].
Note that other AMTs are written in the C++ programming
language, but HPX’s API follows the definition of the C++
standard for the asynchronous programming and the parallel
algorithms. We refer to the references [2], [31]–[33] for more
details about HPX. In this paper, we use HPX for the following
two purposes: 1) the coordination of the synchronous execu-
tion of a multitude of heterogeneous tasks (both on CPUs
and GPUs), thus managing local and distributed parallelism
while observing all necessary data dependencies; and 2) as the
parallelization infrastructure for launching HIP/CUDA-kernels
on the GPUs via the asynchronous HPX backend.
IV. OCTO-TIGER
As a prototypical adaptive mesh refinement (AMR) code
with non-trivial physics we consider Octo-Tiger, an astro-
physics code modelling stellar mergers [1]. Octo-Tiger uses
TABLE I
ACCELERATOR SUPPORT FOR VARIOUS AMTS. WE RESTRICTED OURSELVES TO AMTS WITH SUPPORT FOR NVIDIA AND AMD GPUS.
HPX [2] Chapel [18] Charm++ Legion [19] Uintah [20] ParSec [21] StarPU [22] X10 [23] UPC++ [24]
CUDA X[4] X[11] X[25] X X [12] X[26] X[10] X[13] X
HIP X[7] X X
Kokkos X[7] X[15]
a fast-multipole method (FMM) to solve for gravity [34]. The
implemented FMM globally conserves both linear and angular
momenta up to machine precision. To model and discretize
the hydrodynamics components, a finite volume method using
AMR is employed. Coupling the FMM with the hydro solver
allows global conservation of energy and linear momentum up
to machine precision, a major strength of Octo-Tiger.
A. Scientific Application and Previous Results
Octo-Tiger is designed to model interacting binary star
systems. A binary star system consists of two stars, bound
to one another by gravity. When they are close enough
together, they interact by exchanging mass. Sometimes this
mass transfer is stable and long-lived over millions of years.
Sometimes it is unstable, leading to a catastrophic disruption
of one of the binary’s components. When this happens and
if the system is massive enough, a Type Ia supernova results.
Less massive systems result in the merger of the disrupted
star with its companion, leading to the formation of another
star. The helium rich R Coronae Borealis stars are thought to
originate from a merger of two white dwarfs.
Octo-Tiger models such systems as self-gravitating flu-
ids, governed by the laws of hydrodynamics and Newtonian
gravity. The code has been used to investigate the origins
of R Coronae Borealis stars (e.g. [35], [36]), the merger
of bipolytropic stars [37], and the possibility that the star
Betelgeuse is the outcome of a merger [38]. Presently, Octo-
Tiger is used for the investigation of merging double white
dwarfs as well as the merger of a contact binary, V1309 Sco.
B. Hydro Solver
Octo-Tiger solves the inviscid Euler equations. This set
of hyperbolic differential equations governs the conservation
of mass, momentum, and energy as a fluid evolves. Octo-
Tiger is a grid based code, using Cartesian adaptive mesh
refinement to discretize the fluid variables. Octo-Tiger uses
the piecewise-parabolic method [39] to compute the values of
the evolved variables at 26 quadrature points on the surface
of each computational cell - one for the centers of each cell
face and each cell edge, and one for each cell vertex. With
the reconstructed variables, the fluxes are computed at these
points using the central upwind method as described by [40].
They are integrated using Newton-Cotes quadrature to obtain
the total flux through a cell face. The maximum allowed time-
step size is related to the “Courant condition”: The time-step
size has to be at most the minimum time it takes a signal to
cross a computational cell’s width. Exceeding this time-step
size results in errors in the solution that grow rapidly with time.
GPU
Stream 1
(a)
GPU
...
Stream 1 Stream n
(b)
GPU
CPPuddle
Aggregator
Underlying
GPU Stream
Aggregation
Executor
(c)
Fig. 1. Aggregation strategies: (a) Larger sub-problems: Increasing sub-grid
size, (b) Implicit work aggregation: Interleaving independent GPU kernels by
using multiple GPU executors (streams), and (c) Explicit work aggregation:
Marking compatible tasks (blue) which might be aggregated together to a
larger kernel if the hardware is currently busy.
If we double the resolution of the model without altering the
model’s size, this signal crossing time will be roughly cut in
half, reducing the allowed time-step size by the same factor.
The AMR feature of Octo-Tiger is designed to refine around
interesting areas of the binary. One level of refinement is
assigned to each component as a whole, allowing a smaller
component to be modelled with higher resolution than its
companion. The cores of stars with core/envelope structures
can be given additional levels of refinement. Recently, the use
of gradient-based refinement is being investigated as well to
additionally refine the star’s atmospheres.
C. Previous Scalability/Performance
Performance on up to 5400 GPUs and 64,800 cores on
CSCS’s Piz Daint was shown in [4]. Performance on up to
658,784 Intel Knight’s Landing cores with a parallel efficiency
of 96.8% using billions of asynchronous tasks was demon-
strated in [41] on NERSC’s Cori. Performance on ORNLs
Summit was shown in [5].
V. AGGREGATION STRATEGIES AND GPU
IMPLEMENTATION DETAILS
In this section, we first give some details about the im-
plementation of the hydro solver, especially regarding the
workload per GPU kernel. As these numbers motivate our need
for larger numbers of GPU work items, we continue with the
introduction of three strategies for increasing the size of the
GPU kernels. For each of the strategies, we first introduce
the high-level idea, mention the strategy’s requirements, then
briefly talk about the implementation details and their respec-
tive benefits and challenges. While the first two strategies have
already been used with Octo-Tiger, the last strategy (strategy
3) is a novel addition of this work.
摘要:

FromTask-BasedGPUWorkAggregationtoStellarMergers:TurningFine-GrainedCPUTasksintoPortableGPUKernelsGregorDaißy,PatrickDiehlx,DominicMarcello,AlirezaKheirkhahan,HartmutKaiser,DirkP¨ugeryLSUCenterforComputation&Technology,LouisianaStateUniversity,BatonRouge,LA,70803U.S.AyIPVS,UniversityofStuttgar...

展开>> 收起<<
From Task-Based GPU Work Aggregation to Stellar Mergers Turning Fine-Grained CPU Tasks into Portable GPU Kernels.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:341.43KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注