From Merging Frameworks to Merging Stars Experiences using HPX Kokkos and SIMD Types Gregor Daiß

2025-04-27 0 0 1.09MB 11 页 10玖币
侵权投诉
From Merging Frameworks to Merging Stars:
Experiences using HPX, Kokkos and SIMD Types
Gregor Daiß , Srinivas Yadav Singanaboina, Patrick Diehl §, Hartmut Kaiser , Dirk Pfl¨
uger
LSU Center for Computation & Technology, Louisiana State University, Baton Rouge, LA, 70803 U.S.A
IPVS, University of Stuttgart, Stuttgart, 70174 Stuttgart, Germany
Email: Gregor.Daiss@ipvs.uni-stuttgart.de
§Department of Physics and Astronomy, Louisiana State University, Baton Rouge, LA, 70803 U.S.A.
Abstract—Octo-Tiger, a large-scale 3D AMR code for the
merger of stars, uses a combination of HPX, Kokkos and
explicit SIMD types, aiming to achieve performance-portability
for a broad range of heterogeneous hardware. However, on
A64FX CPUs, we encountered several missing pieces, hindering
performance by causing problems with the SIMD vectorization.
Therefore, we add std::experimental::simd as an option to use
in Octo-Tiger’s Kokkos kernels alongside Kokkos SIMD, and
further add a new SVE (Scalable Vector Extensions) SIMD
backend. Additionally, we amend missing SIMD implementations
in the Kokkos kernels within Octo-Tiger’s hydro solver. We test
our changes by running Octo-Tiger on three different CPUs: An
A64FX, an Intel Icelake and an AMD EPYC CPU, evaluating
SIMD speedup and node-level performance. We get a good SIMD
speedup on the A64FX CPU, as well as noticeable speedups on
the other two CPU platforms. However, we also experience a
scaling issue on the EPYC CPU.
Index Terms—HPX, Kokkos, SIMD, astrophysical simulation
I. INTRODUCTION
Astrophysical applications drive the need for high-
performance computing. They require a lot of computational
power for their simulation, as well as numerous developer
hours to ascertain they run efficiently on the ever-changing
set of current hardware platforms.
Octo-Tiger is an astrophysical simulation used to simulate
binary star systems and their eventual outcomes [1]. Octo-
Tiger is built on HPX [2] for task-based programming to scale
to all cores within one compute node and, beyond that, to
thousands of other compute nodes in distributed scenarios.
Using this, Octo-Tiger achieved scalability on Cori [3], and
more recently Piz Daint [4] and Summit [5].
Consequently, given these last two machines, the most
recent work on Octo-Tiger was focused on porting the original
CPU-only implementation of Octo-Tiger to GPUs. To have
portable GPU kernels, much of this was done with Kokkos
and an HPX-Kokkos integration to allow us to use Kokkos
kernels as HPX tasks. To improve CPU performance of said
Kokkos kernels, we used SIMD (Single Instruction Multiple
Data) types provided by Kokkos [6]. While working well in
Octo-Tiger, this was only used in the Kokkos kernels of one
solver and only tested on older CPU platforms [7].
In turn, this work focuses not on running Octo-Tiger on
GPUs, but instead on our development process modifying
Octo-Tiger to run efficiently on Fujitsu A64FXCPUs. This is
done to prepare Octo-Tiger for experimental runs on Fugaku.
Hence, we will use the current development snapshot of
Octo-Tiger to investigate how well our utilized execution
model (a mixture of HPX, Kokkos, and explicit vectorization)
works on modern CPUs and share some additions we had
to make when porting to A64FX. Although said mixture of
frameworks already gained us a certain degree of portability,
we had to supplement some missing ingredients in this work
to make Octo-Tiger run more efficiently on A64FX CPUs and
make use of their SVE SIMD instructions:
1) To have a wider range of SIMD backends accessible, we
integrated std :: experimental :: simd within our Kokkos
kernels (while maintaining compatibility with the Kokkos
SIMD types).
2) We add std :: experimental :: simd compatible SVE types,
allowing for explicit vectorization on A64FX CPUs.
3) Some of our Kokkos kernels simply did not yet use the
explicit vectorization with SIMD types, as they have only
been used on GPUs so far. Therefore, we supplement
those kernels (two kernels in our new hydro solver) with
these types and apply SIMD masking where required.
To test these new additions, we run node-level tests on
an A64FX CPU, determining node-level scaling and SIMD
speedup using various backends (both for the entire application
and the most important CPU kernels). We investigate the
speedup of both the new hydro SIMD implementation and
the speedup of the surrounding hydro solver in a hydro-only
compute scenario.
We further use this opportunity to run the same tests (using
different SIMD types) on recent Intel and AMD CPUs, to
evaluate the performance on current CPUs, as previous CPU
results were gathered on now-outdated hardware with fewer
cores [7]. Moreover, those previous results did not yet include
the SIMD additions to the hydro solver from this work.
This paper is structured as follows: Section II briefly in-
troduces the real-world astrophysics application: Octo-Tiger.
Section III emphasizes the technology used in Octo-Tiger
and how its components fit together to create a powerful yet
portable execution model. Section IV describes the changes
that we had to make for this to work for A64FX. Section V
lists the performance measurements conducted with the real-
world application. Section VI focuses on the related work.
Finally, Section VII concludes the work and outlines the next
steps.
arXiv:2210.06439v2 [cs.DC] 9 May 2023
Fig. 1. Flow on the surface of the two stars. In the center we see the
aggregation belt with the mass transfer from the smaller star to the larger
star.
II. SCIENTIFIC APPLICATION: OCTO-TIGER
From the astrophysics perspective, our interest are binary
star systems and their eventual outcomes, especially white
dwarf mergers and the contact binary v1309 Sco. For the latter,
emission of the red light during the merger was observed.
To gain more understanding in the process, self-consistent
simulations with a very high resolution to resolve the star
atmosphere to extract the simulated light curve are necessary.
This would allow for a direct comparison with the observation
data. Furthermore, with more understanding of the light curve
in v1309 Sco, it will be possible to reliably simulate the light
curve of other star mergers.
In the following, we show a brief overview of the solvers
and data-structure Octo-Tiger uses to simulate these binary
star systems.
a) Solvers: The star system is modeled as a self-
gravitating, astrophysical fluid. To solve the system, Octo-
Tiger uses a coupled hydro solver and gravity solver. The in-
viscid Navier-Stokes equations are solved using finite volumes
in the hydro solver [8]. In turn, Newtonian gravity is solved
using a modified Fast Multipole Method (FMM) in the gravity
solver [9].
The most important Octo-Tiger compute kernels are part
of those two solvers. In the hydro solver, there is the
reconstruct kernel, using the piecewise-parabolic method
computing the values of the variables that are being evolved at
26 quadrature points. Furthermore, there is the flux kernel,
which uses those values to compute the final flux using
Newton-Cotes quadrature. In turn, the gravity solver contains
multiple compute kernels calculating the same-level inter-
actions between close-by cells (the most compute-intensive
FMM step): The Monopole kernel for non-refined cells and
the Multipole kernel (and its rho variants for angular
momentum correction and root variant for a specialization for
the root sub-grid). For brevity, we refer to [1] for more details
on the solvers. For more details on gravity solver kernels in
particular, we refer to [10], [11]. Lastly, for a convergence
study of a production run and more details about the hydro
kernels, we refer to [12].
b) Data-Structure: Octo-Tiger uses adaptive mesh refine-
ment (AMR) to focus on the area of interest, the atmosphere
between the two stars with the aggregation belt, see Figure 1.
For efficient computations, an adaptive octree with a Cartesian
sub-grid in each leaf is used. Each sub-grid has 512 cells
within a 8×8×8cube, which is the default. Hence, each
compute kernel invocation is operating on one sub-grid only,
using HPX to launch many such kernels concurrently for the
available sub-grids. The cube size of the sub-grids can be
configured during compile time. The performance benefits of
various sub-grid sizes were studied in [5].
III. HPX, KOKKOS AND SIMD TYPES
Octo-Tiger is built using a combination of frameworks that
make up our execution model: We use HPX for task-based
parallelization and distributed computing. We use Kokkos for
writing compute kernels that are portable between various
CPU and GPU platforms. Lastly, we use SIMD types (pro-
vided by Kokkos) for explicit vectorization on CPUs, whilst
still supporting GPU execution (via instantiation of the SIMD
template type with scalar types).
In this section, we briefly cover these utilized frameworks
and how they can fit together to complement each other.
A. HPX
HPX is an Asynchronous Many-Task Runtime system that
is implemented in C++ [2], [13]. The library implements all
APIs related to concurrency and parallelism as mandated by
recent ISO standards C++20 and C++23 in a conforming
way. In this context, it implements all the (more than 100)
parallel algorithms as described in the C++ standard. It has
been described in detail in other publications, such as [14],
[15].
In the context of this paper, HPX has been used as an
underlying runtime platform for the Octo-Tiger astrophysics
application used as the domain science driver (see [4] for
more details), thus managing local and distributed parallelism
while observing all necessary data dependencies. Data and
task dependencies can be expressed with HPX futures, and
chained together in an execution graph. This graph can be
built asynchronously, with the HPX worker threads processing
the tasks when their dependencies are fulfilled. This task-
based programming model is particularly useful for parallel
implementations of adaptive, tree-based codes like Octo-Tiger,
as we can build the task graph quickly when traversing the tree
to make concurrent work available to the system.
We also use the performance monitoring library APEX [16]
that is well integrated with HPX. APEX can be applied to
capture a combination of task-based events with hardware
counter information for optimizing HPX on different hardware
platforms. APEX can further measure the runtime of annotated
HPX tasks, getting mean execution times for them, which can
be used to determine speedups for specific parts of the code
(such as compute kernels with and without SIMD).
B. HPX with Kokkos
For portability, we use Kokkos’ [17] abstractions for mem-
ory and execution within Octo-Tiger. Kokkos allows us to
easily write compute kernels that work on both CPU and
GPU, allowing us to run the same kernel implementation on
NVIDIA®and AMD®GPUs, and even on the CPU if desired.
Usually, these kernels would be launched in a fork-join man-
ner using Kokkos, for example using an OpenMP execution
space on the CPU to achieve concurrency. However, there exist
two HPX-Kokkos integrations, which allow us to avoid this
and use Kokkos in a more task-based fashion.
The first one is the Kokkos HPX execution space. Kernels
launched within this execution space are split into HPX tasks,
which will be processed by the existing HPX worker threads
(hence there is no need for multiple competing thread pools).
However, this alone does not provide the functionality to
launch Kokkos functions asynchronously and integrate them
within HPX’s execution graph. For that, we use the second
integration: HPX-Kokkos [7]. This integration (and its execu-
tors) allow us to obtain hpx:: futures for asynchronous Kokkos
kernel (and function) launches, facilitating their integration
with HPX’s asynchronous execution graph.
Together, these integrations allow us to write portable
Kokkos kernels that we can launch from arbitrary HPX tasks
and treat as HPX tasks themselves. This way, we can express
dependencies between kernels (and other tasks) using HPX
futures, automatically triggering new tasks when an asyn-
chronously launched kernel finishes. Furthermore, this way the
Kokkos kernels make use of HPX resources for CPU execution
(namely the existing worker threads).
C. Explicit Vectorization with the Kokkos SIMD types
To gain truly portable kernels that run well on both the
CPU and GPU, we need to take SIMD vectorization into
account. Modern CPUs offer a lot of the potential floating-
point performance because of their ability to run instructions
on multiple data items at once (SIMD). While the compiler
can use this automatically in certain circumstances, it is often
more reliable to use the appropriate instructions directly to
ensure SIMD usage. Since different hardware platforms often
use different instruction sets (AVX, AVX512, NEON and SVE
to name some examples), abstractions using C++ types have
been developed to increase portability and ease-of-use. Kokkos
itself includes such SIMD types1, notably capable of using
vectorization on the CPU by compiling to the correct SIMD
instructions, but also able to use scalar types in case the same
compute kernel is compiled for GPU usage [6].
D. Utilizing the Frameworks within Octo-Tiger
Octo-Tiger was built to rely on HPX for parallelization
and distributed computing, with each sub-grid in the oc-
tree representing an HPX component which can be placed
on arbitrary compute nodes. However, its compute kernels
changed significantly over the last two years. In the past,
Octo-Tiger used separate compute kernels for CPU execution
(using Vc for SIMD types) and for the GPU execution (using
NVIDIA®CUDA®) [4], [11]. In an effort to unify these
implementations into a single portable one for each kernel,
1https://github.com/kokkos/simd-math
HPX
Application
(Octo-
Tiger)
CUDA Execution
Space
HPX Execution
Space
Task 1
Task N
Block 1
Block M
...
Scalar Types
Scalar Types
SIMD Types
SIMD Types
SIMD Types
Parallelization
Split Kernel into Compute blocks via Kokkos
Launch
Kokkos Kernel
from arbitrary
thread
Receive HPX
future that will
be ready when
the kernel is
finished Run on HPX worker
threads
Launch Kernel
asynchronously
CPU Execution
GPU Execution
Adapt to target
CPU via types
...
Use scalar instan-
tiation for GPU
Kokkos
HPX-Kokkos
executor
Application + HPX Kokkos + HPX Kokkos SIMD
Fig. 2. The execution model used to launch compute kernels in Octo-Tiger:
We launch Kokkos kernels concurrently from arbitrary HPX tasks (and thus
threads), keeping track of the results via the HPX futures HPX-Kokkos returns.
Kokkos in turn splits the kernel into HPX tasks (using the HPX execution
space). We use SIMD types for explicit SIMD vectorization whilst keeping
the kernel compatible with GPU execution.
we started to slowly switch to Kokkos when we developed
and introduced the general idea and framework of the HPX-
Kokkos integrations [7].
As mentioned in Section II, we have multiple kernels for
each solver: The gravity solver contains multipole and
monopole kernels (and various specializations that include
angular momentum corrections (rho variant) and a special-
ization for the tree root subgrid to process remaining gravity
interactions not handled by sub-grids down the tree. The hydro
solver includes the reconstruct and the flux kernels.
Each kernel only operates on one sub-grid (and its ghost-
layers) at a time. This means that during the solver execution,
we traverse the tree, launching a multitude of different Kokkos
compute kernels in the process. Using the HPX-Kokkos inte-
gration, these kernels can be launched (for different sub-grids)
from multiple tasks simultaneously, with their execution status
and results being integrated into the HPX execution graph
using the futures returned by these asynchronous launches.
Depending on which device the Kokkos kernels use for
execution, it will internally use either scalar types (on GPU)
or types compiling down to the appropriate SIMD instructions
(for example AVX512 for Intel) – provided the kernel has been
implemented with the SIMD types. Of course, we can control
which SIMD types are being used at compile time, meaning
we can easily use scalar types on the CPU as well if we want
to gauge the speedup we gain by using SIMD.
This model of an HPX application launching and synchro-
nizing compute kernels using HPX / HPX-Kokkos, executing
them via Kokkos on the correct device in the correct block
configuration and, finally, having them adapted to the target
device with SIMD types is exemplified in Figure 2.
IV. SOFTWARE CHANGES AND ADDITIONS FOR THIS
WORK
Preparing Octo-Tiger to run on Fugaku, we contributed
multiple changes recently to improve performance on CPU
platforms generally, but on A64FX CPUs particularly.
摘要:

FromMergingFrameworkstoMergingStars:ExperiencesusingHPX,KokkosandSIMDTypesGregorDaißy,SrinivasYadavSinganaboina,PatrickDiehlx,HartmutKaiser,DirkP¨ugeryLSUCenterforComputation&Technology,LouisianaStateUniversity,BatonRouge,LA,70803U.S.AyIPVS,UniversityofStuttgart,Stuttgart,70174Stuttgart,Germany...

展开>> 收起<<
From Merging Frameworks to Merging Stars Experiences using HPX Kokkos and SIMD Types Gregor Daiß.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:1.09MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注