From Task-Based GPU Work Aggregation to Stellar Mergers Turning Fine-Grained CPU Tasks into Portable GPU Kernels

2025-04-27 0 0 341.43KB 12 页 10玖币

侵权投诉

From Task-Based GPU Work Aggregation to Stellar

Mergers: Turning Fine-Grained CPU Tasks into

Portable GPU Kernels

Gregor Daiß †, Patrick Diehl ∗§, Dominic Marcello∗, Alireza Kheirkhahan∗, Hartmut Kaiser∗, Dirk Pﬂ¨

uger †

∗LSU Center for Computation & Technology, Louisiana State University, Baton Rouge, LA, 70803 U.S.A

†IPVS, University of Stuttgart, Stuttgart, 70174 Stuttgart, Germany

Email: Gregor.Daiss@ipvs.uni-stuttgart.de

§Department of Physics and Astronomy, Louisiana State University, Baton Rouge, LA, 70803 U.S.A.

Abstract—Meeting both scalability and performance porta-

bility requirements is a challenge for any HPC application,

especially for adaptively reﬁned ones. In Octo-Tiger, an as-

trophysics application for the simulation of stellar mergers,

we approach this with existing solutions: We employ HPX to

obtain ﬁne-grained tasks to easily distribute work and ﬁnely

overlap communication and computation. For the computations

themselves, we use Kokkos to turn these tasks into compute

kernels capable of running on hardware ranging from a few

CPU cores to powerful accelerators. There is a missing link,

however: while the ﬁne-grained parallelism exposed by HPX

is useful for scalability, it can hinder GPU performance when

the tasks become too small to saturate the device, causing low

resource utilization. To bridge this gap, we investigate multiple

different GPU work aggregation strategies within Octo-Tiger,

adding one new strategy, and evaluate the node-level performance

impact on recent AMD and NVIDIA GPUs, achieving noticeable

speedups.

Index Terms—HPX, HIP, CUDA, Kokkos, Work Aggregation,

Performance Portability, Task-based Programming.

I. INTRODUCTION

Currently, developers of High-Performance-Computing

(HPC) applications are faced with a diverse set of supercom-

puters: Machines like Fugaku contain a massive amount of

compute nodes (158,976), though each node is comparatively

weak with just 48 CPU cores. In contrast, machines like

Perlmutter have by far fewer (1,536), but much more powerful

compute nodes, each containing four NVIDIA®A100 GPUs.

This heterogeneity will further be increased by future ma-

chines like Frontier (using AMD®GPUs) and Aurora (using

Intel®GPUs), highlighting the importance of both scalability

and performance-portability for any HPC application.

One such application is Octo-Tiger, a C++ astrophysics

code, used to study stellar mergers [1]. It is built upon

HPX [2], [3], a distributed asynchronous many-task runtime

system (AMT). HPX allows us to distribute work as ﬁne-

grained tasks, easily overlapping computations and communi-

cation by deﬁning their dependencies and letting the runtime

system handle the concurrency.

However, the very ﬁne-grained tasks that help to scale

Octo-Tiger to thousands of nodes cause difﬁculties regarding

performance-portability when running the application on GPU

platforms. On the one hand, the ﬁne-grained tasks (usually

suited for one CPU core) allow us to express the maximum

amount of parallelism with HPX, to best enable the overlap of

computation and communications for widely distributed runs,

and to more easily handle small tasks/workloads occurring due

to the simulation’s adaptive mesh reﬁnement. On the other

hand, to properly utilize GPUs, we need enough work items

per GPU kernel to both scale to all compute units (for instance

120 on an MI100 GPU) and to have sufﬁcient resident work

items per compute unit to properly hide latencies. In short,

we would like to have large enough tasks to turn into GPU

kernels that do not starve the device. Faced with the conﬂict

between both ﬁne-grained and large tasks, it is not enough to

merely have compute kernels that are capable of running both

on CPUs and GPUs. To achieve optimal performance, it is

rather necessary to dynamically adjust the workload per task

depending on what device it should run on.

In this work, we investigate and compare three work ag-

gregation strategies to increase the workload per compute

kernel within Octo-Tiger, turning ﬁne-grained tasks designed

for distributed CPU scenarios into a good match for GPUs.

Firstly, one conventional approach is to simply subdivide

the overall discretization of an application into larger sub-

problems when building for GPUs than we would for a CPU

build, hence increasing the workload per kernel. In the context

of this work, we will refer to this approach as ”strategy 1”.

Secondly, one can use the ability of GPUs to run multiple

independent kernels concurrently, hence relying on the GPU’s

runtime to implicitly aggregate the kernels on the device

side to avoid low resource utilization (”strategy 2”). This

approach heavily depends on the abilities of the GPU runtime

to handle large amounts of small kernels, as well as the

ability of the application itself to launch said kernels with

little overhead. Lastly, instead of relying on the GPU runtime

to run independent kernels in parallel as for strategy 2, we

can use explicit work aggregation. Thus, we aim to aggregate

similar but independent kernels combined on-the-ﬂy into one

compute kernel whenever the GPU is starved (”strategy 3”).

arXiv:2210.06438v2 [cs.DC] 4 Mar 2023

Past work porting HPX-based codes such as Octo-Tiger

to GPU platforms relied on the second strategy: The

GPU implementation of OctoTiger was ﬁrst introduced

in [4] for Octo-Tiger’s gravity solver. It combines multiple

NVIDIA®CUDA®streams together with an HPX-CUDA

integration treating CUDA kernels as HPX tasks, achieving

good performance.

In turn, in [5] a similar CUDA implementation of Octo-

Tiger’s second major module, the hydrodynamics solver, was

added. It used the same work aggregation strategy as the

gravity solver, however, we showed that increasing the work

size per kernel launch (by subdividing the grid into larger-than-

usual sub-grids, reminiscent of strategy 1) yielded a further

node-level speedup for this solver. This indicates that the work-

load per compute kernel is still problematic in the new hydro

solver implementation when just using strategy 2. Increasing

the sub-grid size using strategy 1 further came at the expense

of scalability and adaptive reﬁnement, hence motivating us to

look for alternative work aggregation strategies beyond these

ﬁrst two strategies.

In this work, we improve upon the state-of-the-art and

introduce an implementation for the explicit work aggregation

strategy (strategy 3), building on HPX and its accelerator

support. This allows us to launch GPU kernels through a

special executor that enables the aggregation of kernels into

larger kernels on-the-ﬂy.

Additionally, we are moving from CUDA to Kokkos, pro-

viding performance portability across different systems [6].

The existing HPX-Kokkos integration layer that allows us to

treat Kokkos kernels as HPX tasks and that allows HPX worker

threads to execute Kokkos kernels, is effectively moving away

from the Fork-Join model [7]. This has already been used for

an implementation of the gravity solver. To achieve strategy

3 with the help of both HPX and Kokkos, we extend the

hydrodynamics solver by a similar Kokkos implementation in

this work. For a fair comparison on AMD GPUs we also added

a HIP version, by simply reusing the CUDA kernels using the

appropriate HIP API calls.

This means that in this work we compare three different

kernel implementations of the hydrodynamics solver: CUDA,

HIP, and Kokkos. For each implementation, we look at results

from all three aforementioned GPU work aggregation strate-

gies. Testing the Octo-Tiger node-level performance on both

an A100 and a MI100 GPU, we show that a combination

of strategies is vastly superior to the current status-quo of

just a single one: We obtain clear speedups on both devices

for Octo-Tiger. We further show that the strategies exhibit

different performance behavior depending on the GPU vendor,

highlighting the need of having alternative strategies at hand

for performance-portability if work aggregation is needed.

Overall, this work has three main contributions: 1) The

novel on-the-ﬂy work aggregation executor (implementing

strategy 3), 2) the implementation of the hydrodynamics

module in Kokkos, and 3) a thorough comparison of the

new work aggregation strategy with the two existing ones

using both the new Kokkos hydro kernels and their previous

CUDA counterparts. For Octo-Tiger itself, our contributions

lead to a signiﬁcant speedup. Beyond Octo-tiger, both the new

aggregation executor and the insights gained by comparing

different GPU work aggregation strategies can be used to ﬁnd

optimal strategies in other HPX applications.

The remainder of this paper is structured as follows:

Section II relates work regarding task-based programming

frameworks to GPU support. Section III introduces HPX, Sec-

tion IV the scientiﬁc application Octo-Tiger. The three work

aggregation strategies are introduced in Section V, followed

by results and their extensive comparison on different systems

in Section VI.

II. RELATED WORK

In this section, we restrict the overview to AMTs with

accelerator support, namely CUDA, HIP, and Kokkos [8].

For a more general overview of AMTs, we refer to [9].

Table I summarizes the accelerator support. For the AMTs

supporting accelerator support, all support NVIDIA GPUs

using CUDA [4], [10]–[14]. The support of AMD GPUs

via HIP is provided by HPX solely. Uinath supports AMD

GPUs via the Kokkos backend [15]. In addition to CUDA and

HIP, HPX provides Kokkos support [7]. Most AMT support

acceleration cards nowadays. Now let us have a look into

the support of work aggregation. Legion showed aggregation

of memory bandwidth of multiple GPUs for Graph Process-

ing [16]. In addition, a novel dynamic load balancing strategy

that is cheap and achieves good load balance across GPUs is

presented. For Chapel, a GPUIterator [17], which supports

hybrid execution of parallel loops across CPUs and GPUs, is

available. However, these solutions are unlike our work, as

we use a bottom-up approach, aggregating small HPX tasks

on-the-ﬂy into larger GPU kernels.

III. C++ STANDARD LIBRARY FOR PARALLELISM AND

CONCURRENCY

One asynchronous many-task system runtime system with

distributed capabilities is the C++ standard library for par-

allelism and concurrency, HPX [2]. One major difference of

HPX from other AMTs is that HPX’s API is fully conform-

ing with the recent and upcoming C++ standard [27]–[30].

Note that other AMTs are written in the C++ programming

language, but HPX’s API follows the deﬁnition of the C++

standard for the asynchronous programming and the parallel

algorithms. We refer to the references [2], [31]–[33] for more

details about HPX. In this paper, we use HPX for the following

two purposes: 1) the coordination of the synchronous execu-

tion of a multitude of heterogeneous tasks (both on CPUs

and GPUs), thus managing local and distributed parallelism

while observing all necessary data dependencies; and 2) as the

parallelization infrastructure for launching HIP/CUDA-kernels

on the GPUs via the asynchronous HPX backend.

IV. OCTO-TIGER

As a prototypical adaptive mesh reﬁnement (AMR) code

with non-trivial physics we consider Octo-Tiger, an astro-

physics code modelling stellar mergers [1]. Octo-Tiger uses

TABLE I

ACCELERATOR SUPPORT FOR VARIOUS AMTS. WE RESTRICTED OURSELVES TO AMTS WITH SUPPORT FOR NVIDIA AND AMD GPUS.

HPX [2] Chapel [18] Charm++ Legion [19] Uintah [20] ParSec [21] StarPU [22] X10 [23] UPC++ [24]

CUDA X[4] X[11] X[25] X X [12] X[26] X[10] X[13] X

HIP X[7] X X

Kokkos X[7] X[15]

a fast-multipole method (FMM) to solve for gravity [34]. The

implemented FMM globally conserves both linear and angular

momenta up to machine precision. To model and discretize

the hydrodynamics components, a ﬁnite volume method using

AMR is employed. Coupling the FMM with the hydro solver

allows global conservation of energy and linear momentum up

to machine precision, a major strength of Octo-Tiger.

A. Scientiﬁc Application and Previous Results

Octo-Tiger is designed to model interacting binary star

systems. A binary star system consists of two stars, bound

to one another by gravity. When they are close enough

together, they interact by exchanging mass. Sometimes this

mass transfer is stable and long-lived over millions of years.

Sometimes it is unstable, leading to a catastrophic disruption

of one of the binary’s components. When this happens and

if the system is massive enough, a Type Ia supernova results.

Less massive systems result in the merger of the disrupted

star with its companion, leading to the formation of another

star. The helium rich R Coronae Borealis stars are thought to

originate from a merger of two white dwarfs.

Octo-Tiger models such systems as self-gravitating ﬂu-

ids, governed by the laws of hydrodynamics and Newtonian

gravity. The code has been used to investigate the origins

of R Coronae Borealis stars (e.g. [35], [36]), the merger

of bipolytropic stars [37], and the possibility that the star

Betelgeuse is the outcome of a merger [38]. Presently, Octo-

Tiger is used for the investigation of merging double white

dwarfs as well as the merger of a contact binary, V1309 Sco.

B. Hydro Solver

Octo-Tiger solves the inviscid Euler equations. This set

of hyperbolic differential equations governs the conservation

of mass, momentum, and energy as a ﬂuid evolves. Octo-

Tiger is a grid based code, using Cartesian adaptive mesh

reﬁnement to discretize the ﬂuid variables. Octo-Tiger uses

the piecewise-parabolic method [39] to compute the values of

the evolved variables at 26 quadrature points on the surface

of each computational cell - one for the centers of each cell

face and each cell edge, and one for each cell vertex. With

the reconstructed variables, the ﬂuxes are computed at these

points using the central upwind method as described by [40].

They are integrated using Newton-Cotes quadrature to obtain

the total ﬂux through a cell face. The maximum allowed time-

step size is related to the “Courant condition”: The time-step

size has to be at most the minimum time it takes a signal to

cross a computational cell’s width. Exceeding this time-step

size results in errors in the solution that grow rapidly with time.

GPU

Stream 1

(a)

GPU

...

Stream 1 Stream n

(b)

GPU

CPPuddle

Aggregator

Underlying

GPU Stream

Aggregation

Executor

(c)

Fig. 1. Aggregation strategies: (a) Larger sub-problems: Increasing sub-grid

size, (b) Implicit work aggregation: Interleaving independent GPU kernels by

using multiple GPU executors (streams), and (c) Explicit work aggregation:

Marking compatible tasks (blue) which might be aggregated together to a

larger kernel if the hardware is currently busy.

If we double the resolution of the model without altering the

model’s size, this signal crossing time will be roughly cut in

half, reducing the allowed time-step size by the same factor.

The AMR feature of Octo-Tiger is designed to reﬁne around

interesting areas of the binary. One level of reﬁnement is

assigned to each component as a whole, allowing a smaller

component to be modelled with higher resolution than its

companion. The cores of stars with core/envelope structures

can be given additional levels of reﬁnement. Recently, the use

of gradient-based reﬁnement is being investigated as well to

additionally reﬁne the star’s atmospheres.

C. Previous Scalability/Performance

Performance on up to 5400 GPUs and 64,800 cores on

CSCS’s Piz Daint was shown in [4]. Performance on up to

658,784 Intel Knight’s Landing cores with a parallel efﬁciency

of 96.8% using billions of asynchronous tasks was demon-

strated in [41] on NERSC’s Cori. Performance on ORNL’s

Summit was shown in [5].

V. AGGREGATION STRATEGIES AND GPU

IMPLEMENTATION DETAILS

In this section, we ﬁrst give some details about the im-

plementation of the hydro solver, especially regarding the

workload per GPU kernel. As these numbers motivate our need

for larger numbers of GPU work items, we continue with the

introduction of three strategies for increasing the size of the

GPU kernels. For each of the strategies, we ﬁrst introduce

the high-level idea, mention the strategy’s requirements, then

brieﬂy talk about the implementation details and their respec-

tive beneﬁts and challenges. While the ﬁrst two strategies have

already been used with Octo-Tiger, the last strategy (strategy

3) is a novel addition of this work.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FromTask-BasedGPUWorkAggregationtoStellarMergers:TurningFine-GrainedCPUTasksintoPortableGPUKernelsGregorDaißy,PatrickDiehlx,DominicMarcello,AlirezaKheirkhahan,HartmutKaiser,DirkP¨ugeryLSUCenterforComputation&Technology,LouisianaStateUniversity,BatonRouge,LA,70803U.S.AyIPVS,UniversityofStuttgar...

展开>> 收起<<

From Task-Based GPU Work Aggregation to Stellar Mergers Turning Fine-Grained CPU Tasks into Portable GPU Kernels.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

From Task-Based GPU Work Aggregation to Stellar Mergers Turning Fine-Grained CPU Tasks into Portable GPU Kernels

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: