Past work porting HPX-based codes such as Octo-Tiger
to GPU platforms relied on the second strategy: The
GPU implementation of OctoTiger was first introduced
in [4] for Octo-Tiger’s gravity solver. It combines multiple
NVIDIA®CUDA®streams together with an HPX-CUDA
integration treating CUDA kernels as HPX tasks, achieving
good performance.
In turn, in [5] a similar CUDA implementation of Octo-
Tiger’s second major module, the hydrodynamics solver, was
added. It used the same work aggregation strategy as the
gravity solver, however, we showed that increasing the work
size per kernel launch (by subdividing the grid into larger-than-
usual sub-grids, reminiscent of strategy 1) yielded a further
node-level speedup for this solver. This indicates that the work-
load per compute kernel is still problematic in the new hydro
solver implementation when just using strategy 2. Increasing
the sub-grid size using strategy 1 further came at the expense
of scalability and adaptive refinement, hence motivating us to
look for alternative work aggregation strategies beyond these
first two strategies.
In this work, we improve upon the state-of-the-art and
introduce an implementation for the explicit work aggregation
strategy (strategy 3), building on HPX and its accelerator
support. This allows us to launch GPU kernels through a
special executor that enables the aggregation of kernels into
larger kernels on-the-fly.
Additionally, we are moving from CUDA to Kokkos, pro-
viding performance portability across different systems [6].
The existing HPX-Kokkos integration layer that allows us to
treat Kokkos kernels as HPX tasks and that allows HPX worker
threads to execute Kokkos kernels, is effectively moving away
from the Fork-Join model [7]. This has already been used for
an implementation of the gravity solver. To achieve strategy
3 with the help of both HPX and Kokkos, we extend the
hydrodynamics solver by a similar Kokkos implementation in
this work. For a fair comparison on AMD GPUs we also added
a HIP version, by simply reusing the CUDA kernels using the
appropriate HIP API calls.
This means that in this work we compare three different
kernel implementations of the hydrodynamics solver: CUDA,
HIP, and Kokkos. For each implementation, we look at results
from all three aforementioned GPU work aggregation strate-
gies. Testing the Octo-Tiger node-level performance on both
an A100 and a MI100 GPU, we show that a combination
of strategies is vastly superior to the current status-quo of
just a single one: We obtain clear speedups on both devices
for Octo-Tiger. We further show that the strategies exhibit
different performance behavior depending on the GPU vendor,
highlighting the need of having alternative strategies at hand
for performance-portability if work aggregation is needed.
Overall, this work has three main contributions: 1) The
novel on-the-fly work aggregation executor (implementing
strategy 3), 2) the implementation of the hydrodynamics
module in Kokkos, and 3) a thorough comparison of the
new work aggregation strategy with the two existing ones
using both the new Kokkos hydro kernels and their previous
CUDA counterparts. For Octo-Tiger itself, our contributions
lead to a significant speedup. Beyond Octo-tiger, both the new
aggregation executor and the insights gained by comparing
different GPU work aggregation strategies can be used to find
optimal strategies in other HPX applications.
The remainder of this paper is structured as follows:
Section II relates work regarding task-based programming
frameworks to GPU support. Section III introduces HPX, Sec-
tion IV the scientific application Octo-Tiger. The three work
aggregation strategies are introduced in Section V, followed
by results and their extensive comparison on different systems
in Section VI.
II. RELATED WORK
In this section, we restrict the overview to AMTs with
accelerator support, namely CUDA, HIP, and Kokkos [8].
For a more general overview of AMTs, we refer to [9].
Table I summarizes the accelerator support. For the AMTs
supporting accelerator support, all support NVIDIA GPUs
using CUDA [4], [10]–[14]. The support of AMD GPUs
via HIP is provided by HPX solely. Uinath supports AMD
GPUs via the Kokkos backend [15]. In addition to CUDA and
HIP, HPX provides Kokkos support [7]. Most AMT support
acceleration cards nowadays. Now let us have a look into
the support of work aggregation. Legion showed aggregation
of memory bandwidth of multiple GPUs for Graph Process-
ing [16]. In addition, a novel dynamic load balancing strategy
that is cheap and achieves good load balance across GPUs is
presented. For Chapel, a GPUIterator [17], which supports
hybrid execution of parallel loops across CPUs and GPUs, is
available. However, these solutions are unlike our work, as
we use a bottom-up approach, aggregating small HPX tasks
on-the-fly into larger GPU kernels.
III. C++ STANDARD LIBRARY FOR PARALLELISM AND
CONCURRENCY
One asynchronous many-task system runtime system with
distributed capabilities is the C++ standard library for par-
allelism and concurrency, HPX [2]. One major difference of
HPX from other AMTs is that HPX’s API is fully conform-
ing with the recent and upcoming C++ standard [27]–[30].
Note that other AMTs are written in the C++ programming
language, but HPX’s API follows the definition of the C++
standard for the asynchronous programming and the parallel
algorithms. We refer to the references [2], [31]–[33] for more
details about HPX. In this paper, we use HPX for the following
two purposes: 1) the coordination of the synchronous execu-
tion of a multitude of heterogeneous tasks (both on CPUs
and GPUs), thus managing local and distributed parallelism
while observing all necessary data dependencies; and 2) as the
parallelization infrastructure for launching HIP/CUDA-kernels
on the GPUs via the asynchronous HPX backend.
IV. OCTO-TIGER
As a prototypical adaptive mesh refinement (AMR) code
with non-trivial physics we consider Octo-Tiger, an astro-
physics code modelling stellar mergers [1]. Octo-Tiger uses