WfBench Automated Generation of Scientific Workflow Benchmarks Taina Coleman Henri CasanovayKetan Maheshwariz Loıc Pottier Sean R. Wilkinsonz

2025-05-06 0 0 591.89KB 12 页 10玖币
侵权投诉
WfBench: Automated Generation of
Scientific Workflow Benchmarks
Tain˜
a Coleman, Henri CasanovaKetan Maheshwari, Lo¨
ıc Pottier, Sean R. Wilkinson
Justin Wozniak§, Fr´
ed´
eric Suter, Mallikarjun Shankar, Rafael Ferreira da Silva
University of Southern California, Marina del Rey, CA, USA University of Hawaii, Honolulu, HI, USA
§Argonne National Laboratory, Lemont, IL, USA Oak Ridge National Laboratory, Oak Ridge, TN, USA
Abstract—The prevalence of scientific workflows with high
computational demands calls for their execution on various dis-
tributed computing platforms, including large-scale leadership-
class high-performance computing (HPC) clusters. To handle the
deployment, monitoring, and optimization of workflow execu-
tions, many workflow systems have been developed over the past
decade. There is a need for workflow benchmarks that can be
used to evaluate the performance of workflow systems on current
and future software stacks and hardware platforms.
We present a generator of realistic workflow benchmark
specifications that can be translated into benchmark code to be
executed with current workflow systems. Our approach generates
workflow tasks with arbitrary performance characteristics (CPU,
memory, and I/O usage) and with realistic task dependency
structures based on those seen in production workflows. We
present experimental results that show that our approach gener-
ates benchmarks that are representative of production workflows,
and conduct a case study to demonstrate the use and usefulness
of our generated benchmarks to evaluate the performance of
workflow systems under different configuration scenarios.
Index Terms—scientific workflows, workflow benchmarks, dis-
tributed computing
I. INTRODUCTION
Scientific workflows have supported some of the most
significant discoveries of the past several decades [1] and are
executed in production daily to serve a wealth of scientific
domains. Many workflows have high computational and I/O
demands that warrant execution on large-scale parallel and
distributed computing platforms. Because of the difficulties
involved in deploying, monitoring, and optimizing workflow
executions on these platforms, the past decade has seen a
dramatic surge of workflow systems [2].
Given the diversity of production workflows, the range
of execution platforms, and the proliferation of workflow
systems, it is crucial to quantify and compare the levels of
performance that can be delivered to workflows by different
platform configurations, workflow systems, and combinations
thereof. As a result, the workflows community has recently
recognized the need for workflow benchmarks [3]. In this
This manuscript has been authored in part by UT-Battelle, LLC, under
contract DE-AC05-00OR22725 with the US Department of Energy (DOE).
The publisher, by accepting the article for publication, acknowledges that
the U.S. Government retains a non-exclusive, paid up, irrevocable, world-
wide license to publish or reproduce the published form of the manuscript, or
allow others to do so, for U.S. Government purposes. The DOE will provide
public access to these results in accordance with the DOE Public Access Plan
(http://energy.gov/downloads/doe-public-access-plan).
paper, we present a generator of realistic workflow benchmark
specifications that can be translated into benchmark code to
be executed with current workflow systems.
A. Motivation
Application benchmarks have long been developed for the
purpose of identifying performance bottlenecks and comparing
HPC platforms. Benchmarks have been developed that stress
various aspects of the platform (e.g., speed of integer and
floating point operations, memory, I/O, and network latency
and throughput) and several developed benchmark suites have
become popular and are commonly used [4]–[9]. A few of
these benchmarks capture some, but not all, of the relevant
features of production workflow applications: (i) A workflow
typically comprises tasks of many different “types”, i.e., that
correspond to computations with different I/O, CPU, GPU,
and memory consumption [10], [11]. (ii) Even tasks of the
same type, i.e., that are invocations of the same program
or function, can have different resource consumption based
on the workflow configuration (e.g., input parameters, input
dataset). (iii) In practice, several workflow tasks are often ex-
ecuted concurrently on a single compute node, causing perfor-
mance interference and exacerbating points (i) and (ii) above,
which impacts workflow execution time. (iv) In production,
workflows are executed using systems that orchestrate their
execution and that can be configured in various ways (e.g.,
task scheduling decisions); Thus, it is crucial for workflow
benchmarks to be seamlessly executable using a wide range
of these systems rather than being implemented using one
particular runtime system (e.g., as is the case for classical
HPC benchmarks implemented with MPI).
To further motivate the need for workflow benchmarks, we
present results obtained from the execution of the benchmarks
proposed in this work on two small 4-node (48 cores per ndoe)
platforms with 2.6GHz Skylake and 2.8GHz Cascadelake
processors, provided by Chameleon Cloud [12]. Benchmarks
are executed using the Pegasus workflow system [13] and con-
figured for 18 different benchmark scenarios. In all scenarios
the same amount of compute work is performed, but using
two different numbers of workflow tasks (500 and 5,000),
three different total amounts of data to read/write from disk
(1 GB, 50 GB, and 100 GB), and three different ratios of com-
pute to memory operations performed by the workflow tasks
(cpu-bound, memory-bound, balanced). For each scenario we
arXiv:2210.03170v1 [cs.DC] 6 Oct 2022
T-5000 D-100 B
T-5000 D-100 C
T-5000 D-50 B
T-500 D-100 B
T-5000 D-1 B
T-500 D-100 C
T-5000 D-50 M
T-5000 D-1 M
T-5000 D-50 C
T-500 D-50 B
T-500 D-1 B
T-5000 D-100 M
T-500 D-1 M
T-5000 D-1 C
T-500 D-100 M
T-500 D-50 M
T-500 D-50 C
T-500 D-1 C
0.6
0.8
1
1.2
1.4
SoyKb 1000Genome Montage Seismology
Scenario
Makespan Ratio
Fig. 1. Makespan ratio between workflow executions on 4-node Cascadelake
and Skylake platforms. The horizontal axis shows experimental scenarios
sorted by increasing Seismology makespan ratios. (T: number of tasks; D:
data footprint in GB; C: cpu-bound; M: memory-bound; B: balanced.) Values
above (resp. below) y= 1 correspond to cases in which the Cascadelake
execution is faster (resp. slower) than the Skylake execution.
generated benchmarks for workflow configurations that are
representative of four different scientific workflow application
domains (two from bioinformatics, one for astronomy, and one
for seismology). All details regarding benchmark generation
and configuration are given in Section III. Figure 1 shows the
ratio between execution times (or makespans) obtained on the
Cascadelake and the Skylake nodes.
Two key observations can be made from these results. First,
results differ significantly across workflow configurations,
as seen in the width of the envelope. Second, trends are
difficult to explain. For instance, considering the SoyKb and
1000Genome data points, we see that for many scenarios they
are close to each other, for many scenarios the SoyKb data
point is well above the 1000Genome data point, and for many
other scenarios the situation is reversed. Overall, we find that
it is difficult to explain, let alone predict, workflow (relative)
makepans based on platform and workflow configurations.
Another example of this difficulty is the fact that Skylake leads
to faster executions for 13 of the 72 benchmark executions.
This is because these particular Skylake nodes happen to
have higher bandwidth disks than the Cascadelake nodes.
However, for some high-data scenarios (e.g., scenarios T-5000-
D-100-M and T-500-D-100-M) Cascadelake executions are
significantly faster. Furthermore, for the Seismology workflow
configuration, Cascadelake is always preferable even for high-
data scenarios. Workflow performance being difficult to predict
is one of the motivations for developing workflow benchmarks.
B. Contributions
In the workflows community, most researchers and prac-
titioners have resorted to using workflow instances from
real-world applications as benchmarks, sometimes including
these instances as part of benchmark suites [14]–[17]. One
drawback is that the obtained results are not generalizable,
especially because specific workflow instances are not con-
figurable and thus may not expose all relevant performance
behaviors or bottlenecks. Another drawback, is that executing
these benchmarks requires installing many scientific software
dependencies (since the benchmark code is actual application
code) and scientific datasets. “Application skeletons” have
been developed that are representative of commonly used
workflow patterns and can be composed to generate synthetic
workflow specifications [18], [19]. These works provide some
basis for constructing task-dependency structures in workflow
benchmarks (to this end this work builds on [19]), but they do
not provide fully-specified, let alone executable, benchmarks.
The key insight in this work is that it is possible to automate
the generation of representative workflow benchmarks that
can be executed on real platforms. The main contribution
is an approach that implements this automation and has the
following capabilities: (i) configurable to be representative of
a wide range of performance characteristics and structures;
(ii) instantiable to be representative of the performance char-
acteristics and structures of real-world workflow applications;
(iii) automatically translatable into executable benchmarks for
execution with arbitrary workflow systems. This approach
not only generates realistic workflow tasks with arbitrary
I/O, CPU, and memory demands (i.e., so as to enable weak
and strong scaling experiments), but also realistic workflow
task graphs that are based on those of real-world workflow
applications, and are agnostic to the workflows management
system and independent underlying platform.
The experimental evaluation of our proposed approach is
twofold. First, we assess the ability of our generated work-
flow benchmarks to mimic the performance characteristics of
production workflow applications. We do so by demonstrating
that I/O, CPU, and memory utilization for the generated
workflow benchmark tasks corresponds to the performance
characteristics of tasks in real-world workflows, and that,
as a result, workflow benchmark executions have temporal
execution patterns similar to that of real-world workflows.
Second, we execute a set of workflow benchmarks generated
using our approach on the Summit leadership-class computing
system and compare measured performance to that derived
from analytical performance models.
The benefits of these benchmarks are manyfold. Scientists
can compare the characteristics and performance of their
workflows to reference benchmark implementations; Workflow
systems developers can leverage these benchmarks as part of
their continuous integration processes; Computing facilities
can assess the performance of their systems beyond the
traditional HPC benchmark implementations; Workflow prac-
titioners can use these benchmarks to perform fair comparison
of competing workflow systems.
II. RELATED WORK
The field of HPC has seen the development of many bench-
mark suites [4]–[6], [20]–[27]. For instance, SPEC combines
knowledge of performance evaluation with the resources to
maintain a benchmarking effort by bringing together bench-
marking and market experts, and customers needs in HPC. In
1994, SPEC’s HPG emerged to extend the evaluation activities
by establishing and maintaining a benchmark suite represen-
tative of real-world HPC applications [27]. More recently,
the SPEChpc 2021 suite provides a group of strong-scaled
application-based benchmarks including metrics per workload
size. Although easily reproducible, its design limits limit its
broadly applicability [28]. HPC settings have been historically
structured around relatively stable technologies and practices
(e.g., monolithic parallel programs applications that use MPI).
Recent work [9] has proposed separating the system-specific
implementation from the specification of the benchmarks, so as
to target different runtime systems. This is also the philosophy
adopted in this work and our benchmarks could easily be
implemented within the framework in [9], which currently
does not include workflow-specific benchmarks.
Some researchers have investigated the automatic generation
of representative benchmarks. For instance, Logan et al. [29]
leverage the notion of skeletons to study the I/O behaviors of
real-world applications. Their approach consists in suppressing
computational parts of parallel applications, so that only
communication and I/O operations remain. Users can then run
the resulting benchmarks, which exhibit the complex I/O and
communication patterns of real-world applications, without
having to experience long execution times. Similarly, Hao
et al. [30] leverage execution traces from real-world parallel
applications to automatically generate synthetic MPI programs
that mimic the I/O behaviors of these applications without
having to execute their computational segments.
In this work, we focus on scientific workflow applications.
Some studies have proposed to use particular domain-specific
workflows as benchmarks [14]–[17], [31]. For instance, Kr-
ishnan et al. [31] propose a benchmark for complex clinical
diagnostic pipelines, in which a particular configuration of a
production pipeline is used as a benchmark. Although these
benchmarks are by definition representative of a real-world
application, they are limited to particular scientific domains
and application configurations. To address this limitation, Katz
et al. [18] and Coleman et al. [19] have proposed approaches
for generating synthetic workflow configurations based on rep-
resentative commonly used workflow patterns. The limitations
are that these works only generate abstract specifications of
workflow task graphs, which is only one of the required
components of an executable workflow benchmark. To the
best of our knowledge, this study is the first to propose a
generic workflow benchmark generation method that makes it
possible to generate executable workflow benchmarks that can
be configured by the users to be representative of a wide range
of relevant scientific workflow configurations.
III. APPROACH
Developing a workflow benchmark requires developing
(i) representative benchmarks of workflow tasks and (ii) rep-
resentative benchmarks of workflows that consist of multiple
tasks with data dependencies. We discuss our approach for
each of the above in the next two sections.
A. Developing Representative Workflow Task Benchmarks
Workflow tasks have different characteristics in terms of
compute-, memory-, and I/O-intensiveness [10], which impact
workflow performance differently on different architectures.
Consequently, a workflow benchmark generation tool should
be configurable, by the user, so that generated benchmark
workflow tasks can exhibit arbitrary such characteristics.
We have developed a generic benchmark (implemented in
Python) that, based on user-provided parameters, launches
instances of different I/O-, CPU-, and/or memory-intensive
operations. The benchmark executions proceeds in three con-
secutive phases1:
#1 Read input from disk: Given a binary file, this phase
of the benchmark simply opens the file with the “rb
option and calls file.readlines() to read the file
content from disk, in a single thread.
#2 Compute: This phase is configured by a number of cores
(n), a total amount of CPU work (cpuwork) to perform,
a total amount of memory work to perform (memwork),
and the fraction of the computation’s instructions that
correspond to non-memory operations (f), which, for
now, must be a multiple of 0.1. This phase starts ngroups
of 10 threads, where threads in the same group are pinned
to the same CPU core (using set_affinity). Within
each group, 10 ×(1 f)threads run a memory-intensive
executable (compiled C++) that computes random ac-
cesses to positions in an array in which one unit is added
to each position up to the total amount of memory work
(memwork) has been performed; and 10 ×fthreads
run a CPU-intensive executable (compiled C++) that
calculates an increasingly precise value of πup until
the specified total amount of computation (cpuwork)
has been performed. In this manner, our benchmark
uses both CPU and memory resources, and parameter
fdefines the relative use of these resources. For both
CPU and memory, the threads are instances of python’s
subprocess calling the benchmarks executable.
#3 Write output to disk: Given a number of bytes, this
phase simply opens an empty binary file with the “wb
option and calls file.write() to write random bytes
to disk in a single thread.
This above approach is relatively simple and makes several
assumptions that do not necessarily hold true for real-world
workflow tasks. For instance, I/O operations could overlap
with computation, and there could be many I/O and compute
phases. Furthermore, our implementation of the compute phase
(phase #2) on the CPU uses multiple threads that can have
complex interference in terms of resource usage (e.g., cache
vs. main memory use). Furthermore, due to our use of 10
threads per core, there is context-switching overhead that likely
does not occur with real-world workflow tasks. Finally, due to
our use of only 10 threads, fcan only take discrete values
(multiples of 0.1), which does not make it possible to capture
arbitrary non-memory/memory operation mixes. Nevertheless,
we claim that this approach makes it possible to instantiate
1We do not use stress test tools such as stress-ng (e.g., using
--vm-bytes or --vm-keep for creating memory pressure, or --hdd
or --hdd-bytes for performing I/O operations) as it does not generate
a precise amount of memory operations or actual files that could be used
downstream in the workflow.
摘要:

WfBench:AutomatedGenerationofScienticWorkowBenchmarksTain˜aColeman,HenriCasanovayKetanMaheshwariz,Lo¨cPottier,SeanR.WilkinsonzJustinWozniakx,Fr´ed´ericSuterz,MallikarjunShankarz,RafaelFerreiradaSilvazUniversityofSouthernCalifornia,MarinadelRey,CA,USAyUniversityofHawaii,Honolulu,HI,USAxArgonneN...

展开>> 收起<<
WfBench Automated Generation of Scientific Workflow Benchmarks Taina Coleman Henri CasanovayKetan Maheshwariz Loıc Pottier Sean R. Wilkinsonz.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:12 页 大小:591.89KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注