
From Merging Frameworks to Merging Stars:
Experiences using HPX, Kokkos and SIMD Types
Gregor Daiß †, Srinivas Yadav Singanaboina∗, Patrick Diehl ∗§, Hartmut Kaiser ∗, Dirk Pfl¨
uger †
∗LSU Center for Computation & Technology, Louisiana State University, Baton Rouge, LA, 70803 U.S.A
†IPVS, University of Stuttgart, Stuttgart, 70174 Stuttgart, Germany
Email: Gregor.Daiss@ipvs.uni-stuttgart.de
§Department of Physics and Astronomy, Louisiana State University, Baton Rouge, LA, 70803 U.S.A.
Abstract—Octo-Tiger, a large-scale 3D AMR code for the
merger of stars, uses a combination of HPX, Kokkos and
explicit SIMD types, aiming to achieve performance-portability
for a broad range of heterogeneous hardware. However, on
A64FX CPUs, we encountered several missing pieces, hindering
performance by causing problems with the SIMD vectorization.
Therefore, we add std::experimental::simd as an option to use
in Octo-Tiger’s Kokkos kernels alongside Kokkos SIMD, and
further add a new SVE (Scalable Vector Extensions) SIMD
backend. Additionally, we amend missing SIMD implementations
in the Kokkos kernels within Octo-Tiger’s hydro solver. We test
our changes by running Octo-Tiger on three different CPUs: An
A64FX, an Intel Icelake and an AMD EPYC CPU, evaluating
SIMD speedup and node-level performance. We get a good SIMD
speedup on the A64FX CPU, as well as noticeable speedups on
the other two CPU platforms. However, we also experience a
scaling issue on the EPYC CPU.
Index Terms—HPX, Kokkos, SIMD, astrophysical simulation
I. INTRODUCTION
Astrophysical applications drive the need for high-
performance computing. They require a lot of computational
power for their simulation, as well as numerous developer
hours to ascertain they run efficiently on the ever-changing
set of current hardware platforms.
Octo-Tiger is an astrophysical simulation used to simulate
binary star systems and their eventual outcomes [1]. Octo-
Tiger is built on HPX [2] for task-based programming to scale
to all cores within one compute node and, beyond that, to
thousands of other compute nodes in distributed scenarios.
Using this, Octo-Tiger achieved scalability on Cori [3], and
more recently Piz Daint [4] and Summit [5].
Consequently, given these last two machines, the most
recent work on Octo-Tiger was focused on porting the original
CPU-only implementation of Octo-Tiger to GPUs. To have
portable GPU kernels, much of this was done with Kokkos
and an HPX-Kokkos integration to allow us to use Kokkos
kernels as HPX tasks. To improve CPU performance of said
Kokkos kernels, we used SIMD (Single Instruction Multiple
Data) types provided by Kokkos [6]. While working well in
Octo-Tiger, this was only used in the Kokkos kernels of one
solver and only tested on older CPU platforms [7].
In turn, this work focuses not on running Octo-Tiger on
GPUs, but instead on our development process modifying
Octo-Tiger to run efficiently on Fujitsu A64FX™CPUs. This is
done to prepare Octo-Tiger for experimental runs on Fugaku.
Hence, we will use the current development snapshot of
Octo-Tiger to investigate how well our utilized execution
model (a mixture of HPX, Kokkos, and explicit vectorization)
works on modern CPUs and share some additions we had
to make when porting to A64FX. Although said mixture of
frameworks already gained us a certain degree of portability,
we had to supplement some missing ingredients in this work
to make Octo-Tiger run more efficiently on A64FX CPUs and
make use of their SVE SIMD instructions:
1) To have a wider range of SIMD backends accessible, we
integrated std :: experimental :: simd within our Kokkos
kernels (while maintaining compatibility with the Kokkos
SIMD types).
2) We add std :: experimental :: simd compatible SVE types,
allowing for explicit vectorization on A64FX CPUs.
3) Some of our Kokkos kernels simply did not yet use the
explicit vectorization with SIMD types, as they have only
been used on GPUs so far. Therefore, we supplement
those kernels (two kernels in our new hydro solver) with
these types and apply SIMD masking where required.
To test these new additions, we run node-level tests on
an A64FX CPU, determining node-level scaling and SIMD
speedup using various backends (both for the entire application
and the most important CPU kernels). We investigate the
speedup of both the new hydro SIMD implementation and
the speedup of the surrounding hydro solver in a hydro-only
compute scenario.
We further use this opportunity to run the same tests (using
different SIMD types) on recent Intel and AMD CPUs, to
evaluate the performance on current CPUs, as previous CPU
results were gathered on now-outdated hardware with fewer
cores [7]. Moreover, those previous results did not yet include
the SIMD additions to the hydro solver from this work.
This paper is structured as follows: Section II briefly in-
troduces the real-world astrophysics application: Octo-Tiger.
Section III emphasizes the technology used in Octo-Tiger
and how its components fit together to create a powerful yet
portable execution model. Section IV describes the changes
that we had to make for this to work for A64FX. Section V
lists the performance measurements conducted with the real-
world application. Section VI focuses on the related work.
Finally, Section VII concludes the work and outlines the next
steps.
arXiv:2210.06439v2 [cs.DC] 9 May 2023