Distributed combined CPU and GPU proﬁling within HPX using APEX Patrick Diehl

2025-04-26 0 0 1.07MB 9 页 10玖币

侵权投诉

Distributed, combined CPU and GPU proﬁling

within HPX using APEX

Patrick Diehl ∗§, Gregor Daiß†, Kevin Huck ‡, Dominic Marcello∗, Sagiv Shiber §, Hartmut Kaiser ∗,

Juhan Frank§, Geoffrey C. Clayton §, and Dirk Pﬂ¨

uger †

∗LSU Center for Computation & Technology, Louisiana State University, Baton Rouge, LA, 70803 U.S.A

Email: patrickdiehl@lsu.edu

†IPVS, University of Stuttgart, Stuttgart, 70174 Stuttgart, Germany

‡OACISS, University of Oregon, Eugene, OR, U.S.A.

§Department of Physics and Astronomy, Louisiana State University, Baton Rouge, LA, 70803 U.S.A.

Abstract—Benchmarking and comparing performance of a

scientiﬁc simulation across hardware platforms is a complex task.

When the simulation in question is constructed with an asyn-

chronous, many-task (AMT) runtime ofﬂoading work to GPUs,

the task becomes even more complex. In this paper, we discuss

the use of a uniquely suited performance measurement library,

APEX, to capture the performance behavior of a simulation built

on HPX, a highly scalable, distributed AMT runtime. We examine

the performance of the astrophysics simulation carried-out by

Octo-Tiger on two different supercomputing architectures. We

analyze the results of scaling and measurement overheads. In

addition, we look in-depth at two similarly conﬁgured executions

on the two systems to study how architectural differences affect

performance and identify opportunities for optimization. As one

such opportunity, we optimize the communication for the hydro

solver and investigated its performance impact.

Index Terms—CUDA™, HPX, Performance Measurements

I. INTRODUCTION

Whether in CPU code, in GPU kernels, or in the inter-

node communication – performance bottlenecks in High-

Performance Computing (HPC) applications may be hidden

in any part of the program. There have been many attempts

to ease the development of HPC applications and to indicate

and to avoid as many bottlenecks as possible. One example is

the asynchronous many-task (AMT) system HPX [1], which

aims to solve some of the more common problems. It makes it

easier to overlap communication and computation, to employ

both CPUs and GPUs, and to avoid overhead for ﬁne-grained

parallelism using lightweight threading.

Even with HPX, of course, it is still perfectly possible

to introduce performance bottlenecks into one’s application.

Being able to collect performance measurements to proﬁle an

HPX application remains important. For AMT systems such as

HPX, it is beneﬁcial to have a proﬁling tool that understands

the task-based nature of the runtime system – for example,

call stacks themselves are less useful: A system thread may

jump back and forth between various HPX tasks as they are

yielded and resumed, and the call stack itself may be dozens

of levels of runtime functions that have no particular interest

to the application developer.

Collecting these measurements in a proﬁling run is challeng-

ing, as one not only needs a proﬁling framework that supports

both CPU and GPU code as well as the distributed collection

of proﬁling data across many compute nodes. One also needs

to keep any overheads introduced by the proﬁling itself to an

absolute minimum. Otherwise, it would not only distort the

collected measurements, but also make large, distributed runs

infeasible, rendering us unable to detect potential performance

bottlenecks that only appear at scale.

HPX is integrated with a performance measurement library,

APEX (Automatic Performance for Exascale), which was

designed speciﬁcally for the HPX runtime and the above

requirements. In previous work, APEX was used together

with the HPX performance counters to collect performance

measurements for Octo-Tiger [2], an astrophysics application

which is built upon HPX and contains optimized kernels for

both CPUs and GPUs [3], [4]. Octo-Tiger is capable of running

the same kernels on the CPUs and GPUs simultaneously (on

different data) depending on the load. In that work, proﬁling

data of Octo-Tiger was gathered with APEX in distributed

CPU-only runs where the energy usage, the idle rate, and

overhead of the HPX AGAS (Active Global Address Space)

was analyzed [5]. Furthermore, combined CPU-GPU proﬁling

runs have been performed on Summit, analyzing the perfor-

mance behavior of Octo-Tiger’s new CUDA™hydro module

in different conﬁgurations for simple benchmark scenarios [6].

All these previous efforts inspire this new work, collecting

performance measurements on both CPU and GPU during a

full-scale production-scenario run on Piz Daint (Cray™XC50

with one 12-core Intel®Xeon®E5-2690 + one NVIDIA®

Tesla®P100 per node) [7] and Summit (IBM®AC922 with

two 22-core Power9™+ six NVIDIA®Tesla®V100 per node)

[8].

The purpose of this work is thus twofold: First, we collect

data that we can actually use to improve Octo-Tiger by identi-

fying potential bottlenecks. To do so, we collect measurements

running the production scenario for 40 time-steps both on

Summit and Piz Daint, using 48 compute nodes on Piz Daint

and 8 compute nodes on Summit (resulting in 48 GPUs in

either case). With those measurements, we can investigate the

speciﬁc parts of Octo-Tiger on two distinct architectures in a

distributed CPU/GPU run, providing insights into the different

runtime behavior of Octo-Tiger regarding GPU-performance,

arXiv:2210.06437v1 [cs.DC] 21 Sep 2022

CPU performance, and communication. For example, the

communication seems to have a larger overhead on Piz Daint.

Second, we are showcasing the feasibility of APEX for

large-scale runs, collecting combined CPU and GPU perfor-

mance measurements, showing that the overhead introduced by

the proﬁling itself is small enough to handle large production-

scale scenarios. To this end, we are running the scenario both

with and without APEX proﬁling enabled for a scaling run

on each machine, to see both the overhead on a few compute

nodes, and the runtime behavior when scaling to more nodes

(with up to 2000 compute nodes on Piz Daint). Furthermore,

we repeat these overhead measurements on Piz Daint for a

CPU-only run to determine the performance impact of the

NVIDIA®CUDA™Proﬁling Tools Interface (CUPTI) which

is used to collect the GPU performance data.

To highlight the need for low proﬁling overhead, we can

look at the short runtimes for each time-step of Octo-Tiger:

During the test runs on Summit, we gather 5GB of data during

all eight runs. For each run 40 time-steps were executed.

Each time-step takes about 0.72s on 128 Summit nodes and

consists of 6 iterations of the gravity solver, 3 iterations of the

hydro solver, and all required communication. On Piz Daint

we collected 55 GB of data in total. Here, each time step

on 2000 nodes took 0.79s. As time steps are serial in nature,

these iterations are our smallest parallel unit. As the time-steps

only run for a few hundred milliseconds overall, overheads

introduced by the proﬁling can be very noticeable even if they

only take a few milliseconds in total as well.

The remainder of this work is structured as follows: In

Section II, we take a brief look at proﬁling solutions in other

AMT frameworks. We then introduce the scientiﬁc scenario

which we are simulating with Octo-Tiger in Section III.

This is the scenario we also used to collect the proﬁling

data by running it for 40 timesteps with APEX enabled.

Section IV in turn introduces Octo-Tiger itself, as well as the

utilized software stack. In Section V, we show and discuss the

collection of the performance measurements for Octo-Tiger. In

Section VI, we test communication optimization and analyze

the performance improvement. Finally, we conclude the paper

in Section VII.

II. RELATED WORK

For the related work, we focus on AMTs with distributed

capabilities which are: Legion [9], Charm++ [10], Chapel [11],

and UPC++ [12]. For a more detailed review, we refer to [13].

Legion [9] provides Legion Prof for combined CPU and

GPU proﬁling which is compiled into all builds. Enabling

the proﬁler produces log ﬁles which can be viewed using

the proﬁler. Charm++ [10] provides Charm debug [14] and

the Projections framework [15] for performance analysis

and visualization. Chapel [11] provides ChplBlamer [16]

for proﬁling. UPC++ seems not to have some dedicated tool

for proﬁling, and any proﬁler supporting C++ is recommended

in their documentation.

Like HPX, nearly all of these runtimes provide a special-

ized tool that has been designed to deal with the particular

challenges of AMTs in general, and the needs of the runtime

system in particular. APEX is a specialized tool in the case of

HPX, and provides similar measurement and analysis abilities

of the above tools, including ﬂat proﬁling, tracing, sampling,

taskgraphs/trees, and concurrency graphs. In addition, APEX

provides support for several programming models/abstractions

with or without HPX, including CUDA™, HIP, OpenMP,

OpenACC, Kokkos, POSIX threads, and C++ threads. APEX

does not provide analysis tools directly, but rather uses com-

monly accepted formats and targets both HPC performance

analysis tools (ParaProf [17], Vampir [18], Perfetto [19]) and

standard data analysis tools (Python, Graphviz [20]).

III. SCIENTIFIC APPLICATION

Stellar mergers are mysterious phenomena that pack a broad

range of physical processes into a small volume and a ﬂeeting

time duration. With the proliferation of deep wide-ﬁeld, time-

domain surveys, we have been catching on camera a vastly

increased number of outbursts, many of which have been

interpreted as stellar mergers. The best case, so far, of an ob-

served merger is V1309 Sco, a contact binary identiﬁed using

a recent survey database, the Optical Gravitational Lensing

Experiment [21]. Fortunately, not only the merger itself was

observed, but archival data from other observing programs

enabled the reconstruction of the light curve years before

the merger. During the merger itself, the system brightness

increased by 4 magnitudes, with a peak luminosity in the

red visible light [22]. This complete record of observations

has led to term V1309 Sco the “Rosetta Stone” of merg-

ers. Previous attempts to model this merger included semi-

analytical calculations [23] and hydrodynamic simulations

(e.g., [24]). However, the hydrodynamic simulations fail to ad-

equately resolve the atmosphere, the rapid transition between

the optically thick merger ﬂuid and the optically thin, nearly

empty space, surroundings of the simulated stellar material. To

overcome this barrier, computational scientists intend to use

the adaptive mesh-reﬁnement hydrodynamics code Octo-Tiger.

Using Octo-Tiger’s dynamic mesh reﬁnement, the simulations

are able to resolve the atmosphere at a higher resolution

than ever before. Simulation of the V1309 merger in high

resolution provide greater insight into the nature of the mass

ﬂow and the consequential angular momentum losses. In this

paper, we analyze the performance of Octo-Tiger to identify

potential bottlenecks in the combined CPU and GPU long-term

production runs, where the atmosphere is maximally resolved.

Analyzing the performance is essential at this stage since this

model will serve as the necessary baseline for extending Octo-

Tiger to include radiation transport to the V1309 model, as

well as to other binary merger models.

IV. SOFTWARE FRAMEWORK

A. C++ standard library for parallelism and concurrency

HPX is the C++ standard library for parallelism and concur-

rency [1] and one of the distributed asynchronous many-task

runtime systems) AMT. Other notable AMTs with distributed

capabilities are: Uintah [25], Chapel [11], Charm++ [10],

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Distributed,combinedCPUandGPUprolingwithinHPXusingAPEXPatrickDiehlx,GregorDaißy,KevinHuckz,DominicMarcello,SagivShiberx,HartmutKaiser,JuhanFrankx,GeoffreyC.Claytonx,andDirkP¨ugeryLSUCenterforComputation&Technology,LouisianaStateUniversity,BatonRouge,LA,70803U.S.AEmail:patrickdiehl@lsu.eduyIPVS...

展开>> 收起<<

Distributed combined CPU and GPU proﬁling within HPX using APEX Patrick Diehl.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Distributed combined CPU and GPU proﬁling within HPX using APEX Patrick Diehl

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: