Distributed combined CPU and GPU profiling within HPX using APEX Patrick Diehl

2025-04-26 0 0 1.07MB 9 页 10玖币
侵权投诉
Distributed, combined CPU and GPU profiling
within HPX using APEX
Patrick Diehl §, Gregor Daiß, Kevin Huck , Dominic Marcello, Sagiv Shiber §, Hartmut Kaiser ,
Juhan Frank§, Geoffrey C. Clayton §, and Dirk Pfl¨
uger
LSU Center for Computation & Technology, Louisiana State University, Baton Rouge, LA, 70803 U.S.A
Email: patrickdiehl@lsu.edu
IPVS, University of Stuttgart, Stuttgart, 70174 Stuttgart, Germany
OACISS, University of Oregon, Eugene, OR, U.S.A.
§Department of Physics and Astronomy, Louisiana State University, Baton Rouge, LA, 70803 U.S.A.
Abstract—Benchmarking and comparing performance of a
scientific simulation across hardware platforms is a complex task.
When the simulation in question is constructed with an asyn-
chronous, many-task (AMT) runtime offloading work to GPUs,
the task becomes even more complex. In this paper, we discuss
the use of a uniquely suited performance measurement library,
APEX, to capture the performance behavior of a simulation built
on HPX, a highly scalable, distributed AMT runtime. We examine
the performance of the astrophysics simulation carried-out by
Octo-Tiger on two different supercomputing architectures. We
analyze the results of scaling and measurement overheads. In
addition, we look in-depth at two similarly configured executions
on the two systems to study how architectural differences affect
performance and identify opportunities for optimization. As one
such opportunity, we optimize the communication for the hydro
solver and investigated its performance impact.
Index Terms—CUDA, HPX, Performance Measurements
I. INTRODUCTION
Whether in CPU code, in GPU kernels, or in the inter-
node communication – performance bottlenecks in High-
Performance Computing (HPC) applications may be hidden
in any part of the program. There have been many attempts
to ease the development of HPC applications and to indicate
and to avoid as many bottlenecks as possible. One example is
the asynchronous many-task (AMT) system HPX [1], which
aims to solve some of the more common problems. It makes it
easier to overlap communication and computation, to employ
both CPUs and GPUs, and to avoid overhead for fine-grained
parallelism using lightweight threading.
Even with HPX, of course, it is still perfectly possible
to introduce performance bottlenecks into one’s application.
Being able to collect performance measurements to profile an
HPX application remains important. For AMT systems such as
HPX, it is beneficial to have a profiling tool that understands
the task-based nature of the runtime system – for example,
call stacks themselves are less useful: A system thread may
jump back and forth between various HPX tasks as they are
yielded and resumed, and the call stack itself may be dozens
of levels of runtime functions that have no particular interest
to the application developer.
Collecting these measurements in a profiling run is challeng-
ing, as one not only needs a profiling framework that supports
both CPU and GPU code as well as the distributed collection
of profiling data across many compute nodes. One also needs
to keep any overheads introduced by the profiling itself to an
absolute minimum. Otherwise, it would not only distort the
collected measurements, but also make large, distributed runs
infeasible, rendering us unable to detect potential performance
bottlenecks that only appear at scale.
HPX is integrated with a performance measurement library,
APEX (Automatic Performance for Exascale), which was
designed specifically for the HPX runtime and the above
requirements. In previous work, APEX was used together
with the HPX performance counters to collect performance
measurements for Octo-Tiger [2], an astrophysics application
which is built upon HPX and contains optimized kernels for
both CPUs and GPUs [3], [4]. Octo-Tiger is capable of running
the same kernels on the CPUs and GPUs simultaneously (on
different data) depending on the load. In that work, profiling
data of Octo-Tiger was gathered with APEX in distributed
CPU-only runs where the energy usage, the idle rate, and
overhead of the HPX AGAS (Active Global Address Space)
was analyzed [5]. Furthermore, combined CPU-GPU profiling
runs have been performed on Summit, analyzing the perfor-
mance behavior of Octo-Tiger’s new CUDAhydro module
in different configurations for simple benchmark scenarios [6].
All these previous efforts inspire this new work, collecting
performance measurements on both CPU and GPU during a
full-scale production-scenario run on Piz Daint (CrayXC50
with one 12-core Intel®Xeon®E5-2690 + one NVIDIA®
Tesla®P100 per node) [7] and Summit (IBM®AC922 with
two 22-core Power9+ six NVIDIA®Tesla®V100 per node)
[8].
The purpose of this work is thus twofold: First, we collect
data that we can actually use to improve Octo-Tiger by identi-
fying potential bottlenecks. To do so, we collect measurements
running the production scenario for 40 time-steps both on
Summit and Piz Daint, using 48 compute nodes on Piz Daint
and 8 compute nodes on Summit (resulting in 48 GPUs in
either case). With those measurements, we can investigate the
specific parts of Octo-Tiger on two distinct architectures in a
distributed CPU/GPU run, providing insights into the different
runtime behavior of Octo-Tiger regarding GPU-performance,
arXiv:2210.06437v1 [cs.DC] 21 Sep 2022
CPU performance, and communication. For example, the
communication seems to have a larger overhead on Piz Daint.
Second, we are showcasing the feasibility of APEX for
large-scale runs, collecting combined CPU and GPU perfor-
mance measurements, showing that the overhead introduced by
the profiling itself is small enough to handle large production-
scale scenarios. To this end, we are running the scenario both
with and without APEX profiling enabled for a scaling run
on each machine, to see both the overhead on a few compute
nodes, and the runtime behavior when scaling to more nodes
(with up to 2000 compute nodes on Piz Daint). Furthermore,
we repeat these overhead measurements on Piz Daint for a
CPU-only run to determine the performance impact of the
NVIDIA®CUDAProfiling Tools Interface (CUPTI) which
is used to collect the GPU performance data.
To highlight the need for low profiling overhead, we can
look at the short runtimes for each time-step of Octo-Tiger:
During the test runs on Summit, we gather 5GB of data during
all eight runs. For each run 40 time-steps were executed.
Each time-step takes about 0.72s on 128 Summit nodes and
consists of 6 iterations of the gravity solver, 3 iterations of the
hydro solver, and all required communication. On Piz Daint
we collected 55 GB of data in total. Here, each time step
on 2000 nodes took 0.79s. As time steps are serial in nature,
these iterations are our smallest parallel unit. As the time-steps
only run for a few hundred milliseconds overall, overheads
introduced by the profiling can be very noticeable even if they
only take a few milliseconds in total as well.
The remainder of this work is structured as follows: In
Section II, we take a brief look at profiling solutions in other
AMT frameworks. We then introduce the scientific scenario
which we are simulating with Octo-Tiger in Section III.
This is the scenario we also used to collect the profiling
data by running it for 40 timesteps with APEX enabled.
Section IV in turn introduces Octo-Tiger itself, as well as the
utilized software stack. In Section V, we show and discuss the
collection of the performance measurements for Octo-Tiger. In
Section VI, we test communication optimization and analyze
the performance improvement. Finally, we conclude the paper
in Section VII.
II. RELATED WORK
For the related work, we focus on AMTs with distributed
capabilities which are: Legion [9], Charm++ [10], Chapel [11],
and UPC++ [12]. For a more detailed review, we refer to [13].
Legion [9] provides Legion Prof for combined CPU and
GPU profiling which is compiled into all builds. Enabling
the profiler produces log files which can be viewed using
the profiler. Charm++ [10] provides Charm debug [14] and
the Projections framework [15] for performance analysis
and visualization. Chapel [11] provides ChplBlamer [16]
for profiling. UPC++ seems not to have some dedicated tool
for profiling, and any profiler supporting C++ is recommended
in their documentation.
Like HPX, nearly all of these runtimes provide a special-
ized tool that has been designed to deal with the particular
challenges of AMTs in general, and the needs of the runtime
system in particular. APEX is a specialized tool in the case of
HPX, and provides similar measurement and analysis abilities
of the above tools, including flat profiling, tracing, sampling,
taskgraphs/trees, and concurrency graphs. In addition, APEX
provides support for several programming models/abstractions
with or without HPX, including CUDA, HIP, OpenMP,
OpenACC, Kokkos, POSIX threads, and C++ threads. APEX
does not provide analysis tools directly, but rather uses com-
monly accepted formats and targets both HPC performance
analysis tools (ParaProf [17], Vampir [18], Perfetto [19]) and
standard data analysis tools (Python, Graphviz [20]).
III. SCIENTIFIC APPLICATION
Stellar mergers are mysterious phenomena that pack a broad
range of physical processes into a small volume and a fleeting
time duration. With the proliferation of deep wide-field, time-
domain surveys, we have been catching on camera a vastly
increased number of outbursts, many of which have been
interpreted as stellar mergers. The best case, so far, of an ob-
served merger is V1309 Sco, a contact binary identified using
a recent survey database, the Optical Gravitational Lensing
Experiment [21]. Fortunately, not only the merger itself was
observed, but archival data from other observing programs
enabled the reconstruction of the light curve years before
the merger. During the merger itself, the system brightness
increased by 4 magnitudes, with a peak luminosity in the
red visible light [22]. This complete record of observations
has led to term V1309 Sco the “Rosetta Stone” of merg-
ers. Previous attempts to model this merger included semi-
analytical calculations [23] and hydrodynamic simulations
(e.g., [24]). However, the hydrodynamic simulations fail to ad-
equately resolve the atmosphere, the rapid transition between
the optically thick merger fluid and the optically thin, nearly
empty space, surroundings of the simulated stellar material. To
overcome this barrier, computational scientists intend to use
the adaptive mesh-refinement hydrodynamics code Octo-Tiger.
Using Octo-Tiger’s dynamic mesh refinement, the simulations
are able to resolve the atmosphere at a higher resolution
than ever before. Simulation of the V1309 merger in high
resolution provide greater insight into the nature of the mass
flow and the consequential angular momentum losses. In this
paper, we analyze the performance of Octo-Tiger to identify
potential bottlenecks in the combined CPU and GPU long-term
production runs, where the atmosphere is maximally resolved.
Analyzing the performance is essential at this stage since this
model will serve as the necessary baseline for extending Octo-
Tiger to include radiation transport to the V1309 model, as
well as to other binary merger models.
IV. SOFTWARE FRAMEWORK
A. C++ standard library for parallelism and concurrency
HPX is the C++ standard library for parallelism and concur-
rency [1] and one of the distributed asynchronous many-task
runtime systems) AMT. Other notable AMTs with distributed
capabilities are: Uintah [25], Chapel [11], Charm++ [10],
摘要:

Distributed,combinedCPUandGPUprolingwithinHPXusingAPEXPatrickDiehlx,GregorDaißy,KevinHuckz,DominicMarcello,SagivShiberx,HartmutKaiser,JuhanFrankx,GeoffreyC.Claytonx,andDirkP¨ugeryLSUCenterforComputation&Technology,LouisianaStateUniversity,BatonRouge,LA,70803U.S.AEmail:patrickdiehl@lsu.eduyIPVS...

展开>> 收起<<
Distributed combined CPU and GPU profiling within HPX using APEX Patrick Diehl.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:1.07MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注