From Merging Frameworks to Merging Stars Experiences using HPX Kokkos and SIMD Types Gregor Daiß

2025-04-27 0 0 1.09MB 11 页 10玖币

侵权投诉

From Merging Frameworks to Merging Stars:

Experiences using HPX, Kokkos and SIMD Types

Gregor Daiß †, Srinivas Yadav Singanaboina∗, Patrick Diehl ∗§, Hartmut Kaiser ∗, Dirk Pﬂ¨

uger †

∗LSU Center for Computation & Technology, Louisiana State University, Baton Rouge, LA, 70803 U.S.A

†IPVS, University of Stuttgart, Stuttgart, 70174 Stuttgart, Germany

Email: Gregor.Daiss@ipvs.uni-stuttgart.de

§Department of Physics and Astronomy, Louisiana State University, Baton Rouge, LA, 70803 U.S.A.

Abstract—Octo-Tiger, a large-scale 3D AMR code for the

merger of stars, uses a combination of HPX, Kokkos and

explicit SIMD types, aiming to achieve performance-portability

for a broad range of heterogeneous hardware. However, on

A64FX CPUs, we encountered several missing pieces, hindering

performance by causing problems with the SIMD vectorization.

Therefore, we add std::experimental::simd as an option to use

in Octo-Tiger’s Kokkos kernels alongside Kokkos SIMD, and

further add a new SVE (Scalable Vector Extensions) SIMD

backend. Additionally, we amend missing SIMD implementations

in the Kokkos kernels within Octo-Tiger’s hydro solver. We test

our changes by running Octo-Tiger on three different CPUs: An

A64FX, an Intel Icelake and an AMD EPYC CPU, evaluating

SIMD speedup and node-level performance. We get a good SIMD

speedup on the A64FX CPU, as well as noticeable speedups on

the other two CPU platforms. However, we also experience a

scaling issue on the EPYC CPU.

Index Terms—HPX, Kokkos, SIMD, astrophysical simulation

I. INTRODUCTION

Astrophysical applications drive the need for high-

performance computing. They require a lot of computational

power for their simulation, as well as numerous developer

hours to ascertain they run efﬁciently on the ever-changing

set of current hardware platforms.

Octo-Tiger is an astrophysical simulation used to simulate

binary star systems and their eventual outcomes [1]. Octo-

Tiger is built on HPX [2] for task-based programming to scale

to all cores within one compute node and, beyond that, to

thousands of other compute nodes in distributed scenarios.

Using this, Octo-Tiger achieved scalability on Cori [3], and

more recently Piz Daint [4] and Summit [5].

Consequently, given these last two machines, the most

recent work on Octo-Tiger was focused on porting the original

CPU-only implementation of Octo-Tiger to GPUs. To have

portable GPU kernels, much of this was done with Kokkos

and an HPX-Kokkos integration to allow us to use Kokkos

kernels as HPX tasks. To improve CPU performance of said

Kokkos kernels, we used SIMD (Single Instruction Multiple

Data) types provided by Kokkos [6]. While working well in

Octo-Tiger, this was only used in the Kokkos kernels of one

solver and only tested on older CPU platforms [7].

In turn, this work focuses not on running Octo-Tiger on

GPUs, but instead on our development process modifying

Octo-Tiger to run efﬁciently on Fujitsu A64FX™CPUs. This is

done to prepare Octo-Tiger for experimental runs on Fugaku.

Hence, we will use the current development snapshot of

Octo-Tiger to investigate how well our utilized execution

model (a mixture of HPX, Kokkos, and explicit vectorization)

works on modern CPUs and share some additions we had

to make when porting to A64FX. Although said mixture of

frameworks already gained us a certain degree of portability,

we had to supplement some missing ingredients in this work

to make Octo-Tiger run more efﬁciently on A64FX CPUs and

make use of their SVE SIMD instructions:

1) To have a wider range of SIMD backends accessible, we

integrated std :: experimental :: simd within our Kokkos

kernels (while maintaining compatibility with the Kokkos

SIMD types).

2) We add std :: experimental :: simd compatible SVE types,

allowing for explicit vectorization on A64FX CPUs.

3) Some of our Kokkos kernels simply did not yet use the

explicit vectorization with SIMD types, as they have only

been used on GPUs so far. Therefore, we supplement

those kernels (two kernels in our new hydro solver) with

these types and apply SIMD masking where required.

To test these new additions, we run node-level tests on

an A64FX CPU, determining node-level scaling and SIMD

speedup using various backends (both for the entire application

and the most important CPU kernels). We investigate the

speedup of both the new hydro SIMD implementation and

the speedup of the surrounding hydro solver in a hydro-only

compute scenario.

We further use this opportunity to run the same tests (using

different SIMD types) on recent Intel and AMD CPUs, to

evaluate the performance on current CPUs, as previous CPU

results were gathered on now-outdated hardware with fewer

cores [7]. Moreover, those previous results did not yet include

the SIMD additions to the hydro solver from this work.

This paper is structured as follows: Section II brieﬂy in-

troduces the real-world astrophysics application: Octo-Tiger.

Section III emphasizes the technology used in Octo-Tiger

and how its components ﬁt together to create a powerful yet

portable execution model. Section IV describes the changes

that we had to make for this to work for A64FX. Section V

lists the performance measurements conducted with the real-

world application. Section VI focuses on the related work.

Finally, Section VII concludes the work and outlines the next

steps.

arXiv:2210.06439v2 [cs.DC] 9 May 2023

Fig. 1. Flow on the surface of the two stars. In the center we see the

aggregation belt with the mass transfer from the smaller star to the larger

star.

II. SCIENTIFIC APPLICATION: OCTO-TIGER

From the astrophysics perspective, our interest are binary

star systems and their eventual outcomes, especially white

dwarf mergers and the contact binary v1309 Sco. For the latter,

emission of the red light during the merger was observed.

To gain more understanding in the process, self-consistent

simulations with a very high resolution to resolve the star

atmosphere to extract the simulated light curve are necessary.

This would allow for a direct comparison with the observation

data. Furthermore, with more understanding of the light curve

in v1309 Sco, it will be possible to reliably simulate the light

curve of other star mergers.

In the following, we show a brief overview of the solvers

and data-structure Octo-Tiger uses to simulate these binary

star systems.

a) Solvers: The star system is modeled as a self-

gravitating, astrophysical ﬂuid. To solve the system, Octo-

Tiger uses a coupled hydro solver and gravity solver. The in-

viscid Navier-Stokes equations are solved using ﬁnite volumes

in the hydro solver [8]. In turn, Newtonian gravity is solved

using a modiﬁed Fast Multipole Method (FMM) in the gravity

solver [9].

The most important Octo-Tiger compute kernels are part

of those two solvers. In the hydro solver, there is the

reconstruct kernel, using the piecewise-parabolic method

computing the values of the variables that are being evolved at

26 quadrature points. Furthermore, there is the flux kernel,

which uses those values to compute the ﬁnal ﬂux using

Newton-Cotes quadrature. In turn, the gravity solver contains

multiple compute kernels calculating the same-level inter-

actions between close-by cells (the most compute-intensive

FMM step): The Monopole kernel for non-reﬁned cells and

the Multipole kernel (and its rho variants for angular

momentum correction and root variant for a specialization for

the root sub-grid). For brevity, we refer to [1] for more details

on the solvers. For more details on gravity solver kernels in

particular, we refer to [10], [11]. Lastly, for a convergence

study of a production run and more details about the hydro

kernels, we refer to [12].

b) Data-Structure: Octo-Tiger uses adaptive mesh reﬁne-

ment (AMR) to focus on the area of interest, the atmosphere

between the two stars with the aggregation belt, see Figure 1.

For efﬁcient computations, an adaptive octree with a Cartesian

sub-grid in each leaf is used. Each sub-grid has 512 cells

within a 8×8×8cube, which is the default. Hence, each

compute kernel invocation is operating on one sub-grid only,

using HPX to launch many such kernels concurrently for the

available sub-grids. The cube size of the sub-grids can be

conﬁgured during compile time. The performance beneﬁts of

various sub-grid sizes were studied in [5].

III. HPX, KOKKOS AND SIMD TYPES

Octo-Tiger is built using a combination of frameworks that

make up our execution model: We use HPX for task-based

parallelization and distributed computing. We use Kokkos for

writing compute kernels that are portable between various

CPU and GPU platforms. Lastly, we use SIMD types (pro-

vided by Kokkos) for explicit vectorization on CPUs, whilst

still supporting GPU execution (via instantiation of the SIMD

template type with scalar types).

In this section, we brieﬂy cover these utilized frameworks

and how they can ﬁt together to complement each other.

A. HPX

HPX is an Asynchronous Many-Task Runtime system that

is implemented in C++ [2], [13]. The library implements all

APIs related to concurrency and parallelism as mandated by

recent ISO standards C++20 and C++23 in a conforming

way. In this context, it implements all the (more than 100)

parallel algorithms as described in the C++ standard. It has

been described in detail in other publications, such as [14],

[15].

In the context of this paper, HPX has been used as an

underlying runtime platform for the Octo-Tiger astrophysics

application used as the domain science driver (see [4] for

more details), thus managing local and distributed parallelism

while observing all necessary data dependencies. Data and

task dependencies can be expressed with HPX futures, and

chained together in an execution graph. This graph can be

built asynchronously, with the HPX worker threads processing

the tasks when their dependencies are fulﬁlled. This task-

based programming model is particularly useful for parallel

implementations of adaptive, tree-based codes like Octo-Tiger,

as we can build the task graph quickly when traversing the tree

to make concurrent work available to the system.

We also use the performance monitoring library APEX [16]

that is well integrated with HPX. APEX can be applied to

capture a combination of task-based events with hardware

counter information for optimizing HPX on different hardware

platforms. APEX can further measure the runtime of annotated

HPX tasks, getting mean execution times for them, which can

be used to determine speedups for speciﬁc parts of the code

(such as compute kernels with and without SIMD).

B. HPX with Kokkos

For portability, we use Kokkos’ [17] abstractions for mem-

ory and execution within Octo-Tiger. Kokkos allows us to

easily write compute kernels that work on both CPU and

GPU, allowing us to run the same kernel implementation on

NVIDIA®and AMD®GPUs, and even on the CPU if desired.

Usually, these kernels would be launched in a fork-join man-

ner using Kokkos, for example using an OpenMP execution

space on the CPU to achieve concurrency. However, there exist

two HPX-Kokkos integrations, which allow us to avoid this

and use Kokkos in a more task-based fashion.

The ﬁrst one is the Kokkos HPX execution space. Kernels

launched within this execution space are split into HPX tasks,

which will be processed by the existing HPX worker threads

(hence there is no need for multiple competing thread pools).

However, this alone does not provide the functionality to

launch Kokkos functions asynchronously and integrate them

within HPX’s execution graph. For that, we use the second

integration: HPX-Kokkos [7]. This integration (and its execu-

tors) allow us to obtain hpx:: futures for asynchronous Kokkos

kernel (and function) launches, facilitating their integration

with HPX’s asynchronous execution graph.

Together, these integrations allow us to write portable

Kokkos kernels that we can launch from arbitrary HPX tasks

and treat as HPX tasks themselves. This way, we can express

dependencies between kernels (and other tasks) using HPX

futures, automatically triggering new tasks when an asyn-

chronously launched kernel ﬁnishes. Furthermore, this way the

Kokkos kernels make use of HPX resources for CPU execution

(namely the existing worker threads).

C. Explicit Vectorization with the Kokkos SIMD types

To gain truly portable kernels that run well on both the

CPU and GPU, we need to take SIMD vectorization into

account. Modern CPUs offer a lot of the potential ﬂoating-

point performance because of their ability to run instructions

on multiple data items at once (SIMD). While the compiler

can use this automatically in certain circumstances, it is often

more reliable to use the appropriate instructions directly to

ensure SIMD usage. Since different hardware platforms often

use different instruction sets (AVX, AVX512, NEON and SVE

to name some examples), abstractions using C++ types have

been developed to increase portability and ease-of-use. Kokkos

itself includes such SIMD types1, notably capable of using

vectorization on the CPU by compiling to the correct SIMD

instructions, but also able to use scalar types in case the same

compute kernel is compiled for GPU usage [6].

D. Utilizing the Frameworks within Octo-Tiger

Octo-Tiger was built to rely on HPX for parallelization

and distributed computing, with each sub-grid in the oc-

tree representing an HPX component which can be placed

on arbitrary compute nodes. However, its compute kernels

changed signiﬁcantly over the last two years. In the past,

Octo-Tiger used separate compute kernels for CPU execution

(using Vc for SIMD types) and for the GPU execution (using

NVIDIA®CUDA®) [4], [11]. In an effort to unify these

implementations into a single portable one for each kernel,

1https://github.com/kokkos/simd-math

HPX

Application

(Octo-

Tiger)

CUDA Execution

Space

HPX Execution

Space

Task 1

Task N

Block 1

Block M

...

Scalar Types

SIMD Types

Parallelization

Split Kernel into Compute blocks via Kokkos

Launch

Kokkos Kernel

from arbitrary

thread

Receive HPX

future that will

be ready when

the kernel is

finished Run on HPX worker

threads

Launch Kernel

asynchronously

CPU Execution

GPU Execution

Adapt to target

CPU via types

...

Use scalar instan-

tiation for GPU

Kokkos

HPX-Kokkos

executor

Application + HPX Kokkos + HPX Kokkos SIMD

Fig. 2. The execution model used to launch compute kernels in Octo-Tiger:

We launch Kokkos kernels concurrently from arbitrary HPX tasks (and thus

threads), keeping track of the results via the HPX futures HPX-Kokkos returns.

Kokkos in turn splits the kernel into HPX tasks (using the HPX execution

space). We use SIMD types for explicit SIMD vectorization whilst keeping

the kernel compatible with GPU execution.

we started to slowly switch to Kokkos when we developed

and introduced the general idea and framework of the HPX-

Kokkos integrations [7].

As mentioned in Section II, we have multiple kernels for

each solver: The gravity solver contains multipole and

monopole kernels (and various specializations that include

angular momentum corrections (rho variant) and a special-

ization for the tree root subgrid to process remaining gravity

interactions not handled by sub-grids down the tree. The hydro

solver includes the reconstruct and the flux kernels.

Each kernel only operates on one sub-grid (and its ghost-

layers) at a time. This means that during the solver execution,

we traverse the tree, launching a multitude of different Kokkos

compute kernels in the process. Using the HPX-Kokkos inte-

gration, these kernels can be launched (for different sub-grids)

from multiple tasks simultaneously, with their execution status

and results being integrated into the HPX execution graph

using the futures returned by these asynchronous launches.

Depending on which device the Kokkos kernels use for

execution, it will internally use either scalar types (on GPU)

or types compiling down to the appropriate SIMD instructions

(for example AVX512 for Intel) – provided the kernel has been

implemented with the SIMD types. Of course, we can control

which SIMD types are being used at compile time, meaning

we can easily use scalar types on the CPU as well if we want

to gauge the speedup we gain by using SIMD.

This model of an HPX application launching and synchro-

nizing compute kernels using HPX / HPX-Kokkos, executing

them via Kokkos on the correct device in the correct block

conﬁguration and, ﬁnally, having them adapted to the target

device with SIMD types is exempliﬁed in Figure 2.

IV. SOFTWARE CHANGES AND ADDITIONS FOR THIS

WORK

Preparing Octo-Tiger to run on Fugaku, we contributed

multiple changes recently to improve performance on CPU

platforms generally, but on A64FX CPUs particularly.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FromMergingFrameworkstoMergingStars:ExperiencesusingHPX,KokkosandSIMDTypesGregorDaißy,SrinivasYadavSinganaboina,PatrickDiehlx,HartmutKaiser,DirkP¨ugeryLSUCenterforComputation&Technology,LouisianaStateUniversity,BatonRouge,LA,70803U.S.AyIPVS,UniversityofStuttgart,Stuttgart,70174Stuttgart,Germany...

展开>> 收起<<

From Merging Frameworks to Merging Stars Experiences using HPX Kokkos and SIMD Types Gregor Daiß.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

From Merging Frameworks to Merging Stars Experiences using HPX Kokkos and SIMD Types Gregor Daiß

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: