CPU performance, and communication. For example, the
communication seems to have a larger overhead on Piz Daint.
Second, we are showcasing the feasibility of APEX for
large-scale runs, collecting combined CPU and GPU perfor-
mance measurements, showing that the overhead introduced by
the profiling itself is small enough to handle large production-
scale scenarios. To this end, we are running the scenario both
with and without APEX profiling enabled for a scaling run
on each machine, to see both the overhead on a few compute
nodes, and the runtime behavior when scaling to more nodes
(with up to 2000 compute nodes on Piz Daint). Furthermore,
we repeat these overhead measurements on Piz Daint for a
CPU-only run to determine the performance impact of the
NVIDIA®CUDA™Profiling Tools Interface (CUPTI) which
is used to collect the GPU performance data.
To highlight the need for low profiling overhead, we can
look at the short runtimes for each time-step of Octo-Tiger:
During the test runs on Summit, we gather 5GB of data during
all eight runs. For each run 40 time-steps were executed.
Each time-step takes about 0.72s on 128 Summit nodes and
consists of 6 iterations of the gravity solver, 3 iterations of the
hydro solver, and all required communication. On Piz Daint
we collected 55 GB of data in total. Here, each time step
on 2000 nodes took 0.79s. As time steps are serial in nature,
these iterations are our smallest parallel unit. As the time-steps
only run for a few hundred milliseconds overall, overheads
introduced by the profiling can be very noticeable even if they
only take a few milliseconds in total as well.
The remainder of this work is structured as follows: In
Section II, we take a brief look at profiling solutions in other
AMT frameworks. We then introduce the scientific scenario
which we are simulating with Octo-Tiger in Section III.
This is the scenario we also used to collect the profiling
data by running it for 40 timesteps with APEX enabled.
Section IV in turn introduces Octo-Tiger itself, as well as the
utilized software stack. In Section V, we show and discuss the
collection of the performance measurements for Octo-Tiger. In
Section VI, we test communication optimization and analyze
the performance improvement. Finally, we conclude the paper
in Section VII.
II. RELATED WORK
For the related work, we focus on AMTs with distributed
capabilities which are: Legion [9], Charm++ [10], Chapel [11],
and UPC++ [12]. For a more detailed review, we refer to [13].
Legion [9] provides Legion Prof for combined CPU and
GPU profiling which is compiled into all builds. Enabling
the profiler produces log files which can be viewed using
the profiler. Charm++ [10] provides Charm debug [14] and
the Projections framework [15] for performance analysis
and visualization. Chapel [11] provides ChplBlamer [16]
for profiling. UPC++ seems not to have some dedicated tool
for profiling, and any profiler supporting C++ is recommended
in their documentation.
Like HPX, nearly all of these runtimes provide a special-
ized tool that has been designed to deal with the particular
challenges of AMTs in general, and the needs of the runtime
system in particular. APEX is a specialized tool in the case of
HPX, and provides similar measurement and analysis abilities
of the above tools, including flat profiling, tracing, sampling,
taskgraphs/trees, and concurrency graphs. In addition, APEX
provides support for several programming models/abstractions
with or without HPX, including CUDA™, HIP, OpenMP,
OpenACC, Kokkos, POSIX threads, and C++ threads. APEX
does not provide analysis tools directly, but rather uses com-
monly accepted formats and targets both HPC performance
analysis tools (ParaProf [17], Vampir [18], Perfetto [19]) and
standard data analysis tools (Python, Graphviz [20]).
III. SCIENTIFIC APPLICATION
Stellar mergers are mysterious phenomena that pack a broad
range of physical processes into a small volume and a fleeting
time duration. With the proliferation of deep wide-field, time-
domain surveys, we have been catching on camera a vastly
increased number of outbursts, many of which have been
interpreted as stellar mergers. The best case, so far, of an ob-
served merger is V1309 Sco, a contact binary identified using
a recent survey database, the Optical Gravitational Lensing
Experiment [21]. Fortunately, not only the merger itself was
observed, but archival data from other observing programs
enabled the reconstruction of the light curve years before
the merger. During the merger itself, the system brightness
increased by 4 magnitudes, with a peak luminosity in the
red visible light [22]. This complete record of observations
has led to term V1309 Sco the “Rosetta Stone” of merg-
ers. Previous attempts to model this merger included semi-
analytical calculations [23] and hydrodynamic simulations
(e.g., [24]). However, the hydrodynamic simulations fail to ad-
equately resolve the atmosphere, the rapid transition between
the optically thick merger fluid and the optically thin, nearly
empty space, surroundings of the simulated stellar material. To
overcome this barrier, computational scientists intend to use
the adaptive mesh-refinement hydrodynamics code Octo-Tiger.
Using Octo-Tiger’s dynamic mesh refinement, the simulations
are able to resolve the atmosphere at a higher resolution
than ever before. Simulation of the V1309 merger in high
resolution provide greater insight into the nature of the mass
flow and the consequential angular momentum losses. In this
paper, we analyze the performance of Octo-Tiger to identify
potential bottlenecks in the combined CPU and GPU long-term
production runs, where the atmosphere is maximally resolved.
Analyzing the performance is essential at this stage since this
model will serve as the necessary baseline for extending Octo-
Tiger to include radiation transport to the V1309 model, as
well as to other binary merger models.
IV. SOFTWARE FRAMEWORK
A. C++ standard library for parallelism and concurrency
HPX is the C++ standard library for parallelism and concur-
rency [1] and one of the distributed asynchronous many-task
runtime systems) AMT. Other notable AMTs with distributed
capabilities are: Uintah [25], Chapel [11], Charm++ [10],