size. Although easily reproducible, its design limits limit its
broadly applicability [28]. HPC settings have been historically
structured around relatively stable technologies and practices
(e.g., monolithic parallel programs applications that use MPI).
Recent work [9] has proposed separating the system-specific
implementation from the specification of the benchmarks, so as
to target different runtime systems. This is also the philosophy
adopted in this work and our benchmarks could easily be
implemented within the framework in [9], which currently
does not include workflow-specific benchmarks.
Some researchers have investigated the automatic generation
of representative benchmarks. For instance, Logan et al. [29]
leverage the notion of skeletons to study the I/O behaviors of
real-world applications. Their approach consists in suppressing
computational parts of parallel applications, so that only
communication and I/O operations remain. Users can then run
the resulting benchmarks, which exhibit the complex I/O and
communication patterns of real-world applications, without
having to experience long execution times. Similarly, Hao
et al. [30] leverage execution traces from real-world parallel
applications to automatically generate synthetic MPI programs
that mimic the I/O behaviors of these applications without
having to execute their computational segments.
In this work, we focus on scientific workflow applications.
Some studies have proposed to use particular domain-specific
workflows as benchmarks [14]–[17], [31]. For instance, Kr-
ishnan et al. [31] propose a benchmark for complex clinical
diagnostic pipelines, in which a particular configuration of a
production pipeline is used as a benchmark. Although these
benchmarks are by definition representative of a real-world
application, they are limited to particular scientific domains
and application configurations. To address this limitation, Katz
et al. [18] and Coleman et al. [19] have proposed approaches
for generating synthetic workflow configurations based on rep-
resentative commonly used workflow patterns. The limitations
are that these works only generate abstract specifications of
workflow task graphs, which is only one of the required
components of an executable workflow benchmark. To the
best of our knowledge, this study is the first to propose a
generic workflow benchmark generation method that makes it
possible to generate executable workflow benchmarks that can
be configured by the users to be representative of a wide range
of relevant scientific workflow configurations.
III. APPROACH
Developing a workflow benchmark requires developing
(i) representative benchmarks of workflow tasks and (ii) rep-
resentative benchmarks of workflows that consist of multiple
tasks with data dependencies. We discuss our approach for
each of the above in the next two sections.
A. Developing Representative Workflow Task Benchmarks
Workflow tasks have different characteristics in terms of
compute-, memory-, and I/O-intensiveness [10], which impact
workflow performance differently on different architectures.
Consequently, a workflow benchmark generation tool should
be configurable, by the user, so that generated benchmark
workflow tasks can exhibit arbitrary such characteristics.
We have developed a generic benchmark (implemented in
Python) that, based on user-provided parameters, launches
instances of different I/O-, CPU-, and/or memory-intensive
operations. The benchmark executions proceeds in three con-
secutive phases1:
#1 Read input from disk: Given a binary file, this phase
of the benchmark simply opens the file with the “rb”
option and calls file.readlines() to read the file
content from disk, in a single thread.
#2 Compute: This phase is configured by a number of cores
(n), a total amount of CPU work (cpuwork) to perform,
a total amount of memory work to perform (memwork),
and the fraction of the computation’s instructions that
correspond to non-memory operations (f), which, for
now, must be a multiple of 0.1. This phase starts ngroups
of 10 threads, where threads in the same group are pinned
to the same CPU core (using set_affinity). Within
each group, 10 ×(1 −f)threads run a memory-intensive
executable (compiled C++) that computes random ac-
cesses to positions in an array in which one unit is added
to each position up to the total amount of memory work
(memwork) has been performed; and 10 ×fthreads
run a CPU-intensive executable (compiled C++) that
calculates an increasingly precise value of πup until
the specified total amount of computation (cpuwork)
has been performed. In this manner, our benchmark
uses both CPU and memory resources, and parameter
fdefines the relative use of these resources. For both
CPU and memory, the threads are instances of python’s
subprocess calling the benchmarks executable.
#3 Write output to disk: Given a number of bytes, this
phase simply opens an empty binary file with the “wb”
option and calls file.write() to write random bytes
to disk in a single thread.
This above approach is relatively simple and makes several
assumptions that do not necessarily hold true for real-world
workflow tasks. For instance, I/O operations could overlap
with computation, and there could be many I/O and compute
phases. Furthermore, our implementation of the compute phase
(phase #2) on the CPU uses multiple threads that can have
complex interference in terms of resource usage (e.g., cache
vs. main memory use). Furthermore, due to our use of 10
threads per core, there is context-switching overhead that likely
does not occur with real-world workflow tasks. Finally, due to
our use of only 10 threads, fcan only take discrete values
(multiples of 0.1), which does not make it possible to capture
arbitrary non-memory/memory operation mixes. Nevertheless,
we claim that this approach makes it possible to instantiate
1We do not use stress test tools such as stress-ng (e.g., using
--vm-bytes or --vm-keep for creating memory pressure, or --hdd
or --hdd-bytes for performing I/O operations) as it does not generate
a precise amount of memory operations or actual files that could be used
downstream in the workflow.