WfBench Automated Generation of Scientiﬁc Workﬂow Benchmarks Taina Coleman Henri CasanovayKetan Maheshwariz Loıc Pottier Sean R. Wilkinsonz

2025-05-06 0 0 591.89KB 12 页 10玖币

侵权投诉

WfBench: Automated Generation of

Scientiﬁc Workﬂow Benchmarks

Tain˜

a Coleman∗, Henri Casanova†Ketan Maheshwari‡, Lo¨

ıc Pottier∗, Sean R. Wilkinson‡

Justin Wozniak§, Fr´

ed´

eric Suter‡, Mallikarjun Shankar‡, Rafael Ferreira da Silva‡

∗University of Southern California, Marina del Rey, CA, USA †University of Hawaii, Honolulu, HI, USA

§Argonne National Laboratory, Lemont, IL, USA ‡Oak Ridge National Laboratory, Oak Ridge, TN, USA

Abstract—The prevalence of scientiﬁc workﬂows with high

computational demands calls for their execution on various dis-

tributed computing platforms, including large-scale leadership-

class high-performance computing (HPC) clusters. To handle the

deployment, monitoring, and optimization of workﬂow execu-

tions, many workﬂow systems have been developed over the past

decade. There is a need for workﬂow benchmarks that can be

used to evaluate the performance of workﬂow systems on current

and future software stacks and hardware platforms.

We present a generator of realistic workﬂow benchmark

speciﬁcations that can be translated into benchmark code to be

executed with current workﬂow systems. Our approach generates

workﬂow tasks with arbitrary performance characteristics (CPU,

memory, and I/O usage) and with realistic task dependency

structures based on those seen in production workﬂows. We

present experimental results that show that our approach gener-

ates benchmarks that are representative of production workﬂows,

and conduct a case study to demonstrate the use and usefulness

of our generated benchmarks to evaluate the performance of

workﬂow systems under different conﬁguration scenarios.

Index Terms—scientiﬁc workﬂows, workﬂow benchmarks, dis-

tributed computing

I. INTRODUCTION

Scientiﬁc workﬂows have supported some of the most

signiﬁcant discoveries of the past several decades [1] and are

executed in production daily to serve a wealth of scientiﬁc

domains. Many workﬂows have high computational and I/O

demands that warrant execution on large-scale parallel and

distributed computing platforms. Because of the difﬁculties

involved in deploying, monitoring, and optimizing workﬂow

executions on these platforms, the past decade has seen a

dramatic surge of workﬂow systems [2].

Given the diversity of production workﬂows, the range

of execution platforms, and the proliferation of workﬂow

systems, it is crucial to quantify and compare the levels of

performance that can be delivered to workﬂows by different

platform conﬁgurations, workﬂow systems, and combinations

thereof. As a result, the workﬂows community has recently

recognized the need for workﬂow benchmarks [3]. In this

This manuscript has been authored in part by UT-Battelle, LLC, under

contract DE-AC05-00OR22725 with the US Department of Energy (DOE).

The publisher, by accepting the article for publication, acknowledges that

the U.S. Government retains a non-exclusive, paid up, irrevocable, world-

wide license to publish or reproduce the published form of the manuscript, or

allow others to do so, for U.S. Government purposes. The DOE will provide

public access to these results in accordance with the DOE Public Access Plan

(http://energy.gov/downloads/doe-public-access-plan).

paper, we present a generator of realistic workﬂow benchmark

speciﬁcations that can be translated into benchmark code to

be executed with current workﬂow systems.

A. Motivation

Application benchmarks have long been developed for the

purpose of identifying performance bottlenecks and comparing

HPC platforms. Benchmarks have been developed that stress

various aspects of the platform (e.g., speed of integer and

ﬂoating point operations, memory, I/O, and network latency

and throughput) and several developed benchmark suites have

become popular and are commonly used [4]–[9]. A few of

these benchmarks capture some, but not all, of the relevant

features of production workﬂow applications: (i) A workﬂow

typically comprises tasks of many different “types”, i.e., that

correspond to computations with different I/O, CPU, GPU,

and memory consumption [10], [11]. (ii) Even tasks of the

same type, i.e., that are invocations of the same program

or function, can have different resource consumption based

on the workﬂow conﬁguration (e.g., input parameters, input

dataset). (iii) In practice, several workﬂow tasks are often ex-

ecuted concurrently on a single compute node, causing perfor-

mance interference and exacerbating points (i) and (ii) above,

which impacts workﬂow execution time. (iv) In production,

workﬂows are executed using systems that orchestrate their

execution and that can be conﬁgured in various ways (e.g.,

task scheduling decisions); Thus, it is crucial for workﬂow

benchmarks to be seamlessly executable using a wide range

of these systems rather than being implemented using one

particular runtime system (e.g., as is the case for classical

HPC benchmarks implemented with MPI).

To further motivate the need for workﬂow benchmarks, we

present results obtained from the execution of the benchmarks

proposed in this work on two small 4-node (48 cores per ndoe)

platforms with 2.6GHz Skylake and 2.8GHz Cascadelake

processors, provided by Chameleon Cloud [12]. Benchmarks

are executed using the Pegasus workﬂow system [13] and con-

ﬁgured for 18 different benchmark scenarios. In all scenarios

the same amount of compute work is performed, but using

two different numbers of workﬂow tasks (500 and 5,000),

three different total amounts of data to read/write from disk

(1 GB, 50 GB, and 100 GB), and three different ratios of com-

pute to memory operations performed by the workﬂow tasks

(cpu-bound, memory-bound, balanced). For each scenario we

arXiv:2210.03170v1 [cs.DC] 6 Oct 2022

T-5000 D-100 B

T-5000 D-100 C

T-5000 D-50 B

T-500 D-100 B

T-5000 D-1 B

T-500 D-100 C

T-5000 D-50 M

T-5000 D-1 M

T-5000 D-50 C

T-500 D-50 B

T-500 D-1 B

T-5000 D-100 M

T-500 D-1 M

T-5000 D-1 C

T-500 D-100 M

T-500 D-50 M

T-500 D-50 C

T-500 D-1 C

0.6

0.8

1.2

1.4

SoyKb 1000Genome Montage Seismology

Scenario

Makespan Ratio

Fig. 1. Makespan ratio between workﬂow executions on 4-node Cascadelake

and Skylake platforms. The horizontal axis shows experimental scenarios

sorted by increasing Seismology makespan ratios. (T: number of tasks; D:

data footprint in GB; C: cpu-bound; M: memory-bound; B: balanced.) Values

above (resp. below) y= 1 correspond to cases in which the Cascadelake

execution is faster (resp. slower) than the Skylake execution.

generated benchmarks for workﬂow conﬁgurations that are

representative of four different scientiﬁc workﬂow application

domains (two from bioinformatics, one for astronomy, and one

for seismology). All details regarding benchmark generation

and conﬁguration are given in Section III. Figure 1 shows the

ratio between execution times (or makespans) obtained on the

Cascadelake and the Skylake nodes.

Two key observations can be made from these results. First,

results differ signiﬁcantly across workﬂow conﬁgurations,

as seen in the width of the envelope. Second, trends are

difﬁcult to explain. For instance, considering the SoyKb and

1000Genome data points, we see that for many scenarios they

are close to each other, for many scenarios the SoyKb data

point is well above the 1000Genome data point, and for many

other scenarios the situation is reversed. Overall, we ﬁnd that

it is difﬁcult to explain, let alone predict, workﬂow (relative)

makepans based on platform and workﬂow conﬁgurations.

Another example of this difﬁculty is the fact that Skylake leads

to faster executions for 13 of the 72 benchmark executions.

This is because these particular Skylake nodes happen to

have higher bandwidth disks than the Cascadelake nodes.

However, for some high-data scenarios (e.g., scenarios T-5000-

D-100-M and T-500-D-100-M) Cascadelake executions are

signiﬁcantly faster. Furthermore, for the Seismology workﬂow

conﬁguration, Cascadelake is always preferable even for high-

data scenarios. Workﬂow performance being difﬁcult to predict

is one of the motivations for developing workﬂow benchmarks.

B. Contributions

In the workﬂows community, most researchers and prac-

titioners have resorted to using workﬂow instances from

real-world applications as benchmarks, sometimes including

these instances as part of benchmark suites [14]–[17]. One

drawback is that the obtained results are not generalizable,

especially because speciﬁc workﬂow instances are not con-

ﬁgurable and thus may not expose all relevant performance

behaviors or bottlenecks. Another drawback, is that executing

these benchmarks requires installing many scientiﬁc software

dependencies (since the benchmark code is actual application

code) and scientiﬁc datasets. “Application skeletons” have

been developed that are representative of commonly used

workﬂow patterns and can be composed to generate synthetic

workﬂow speciﬁcations [18], [19]. These works provide some

basis for constructing task-dependency structures in workﬂow

benchmarks (to this end this work builds on [19]), but they do

not provide fully-speciﬁed, let alone executable, benchmarks.

The key insight in this work is that it is possible to automate

the generation of representative workﬂow benchmarks that

can be executed on real platforms. The main contribution

is an approach that implements this automation and has the

following capabilities: (i) conﬁgurable to be representative of

a wide range of performance characteristics and structures;

(ii) instantiable to be representative of the performance char-

acteristics and structures of real-world workﬂow applications;

(iii) automatically translatable into executable benchmarks for

execution with arbitrary workﬂow systems. This approach

not only generates realistic workﬂow tasks with arbitrary

I/O, CPU, and memory demands (i.e., so as to enable weak

and strong scaling experiments), but also realistic workﬂow

task graphs that are based on those of real-world workﬂow

applications, and are agnostic to the workﬂows management

system and independent underlying platform.

The experimental evaluation of our proposed approach is

twofold. First, we assess the ability of our generated work-

ﬂow benchmarks to mimic the performance characteristics of

production workﬂow applications. We do so by demonstrating

that I/O, CPU, and memory utilization for the generated

workﬂow benchmark tasks corresponds to the performance

characteristics of tasks in real-world workﬂows, and that,

as a result, workﬂow benchmark executions have temporal

execution patterns similar to that of real-world workﬂows.

Second, we execute a set of workﬂow benchmarks generated

using our approach on the Summit leadership-class computing

system and compare measured performance to that derived

from analytical performance models.

The beneﬁts of these benchmarks are manyfold. Scientists

can compare the characteristics and performance of their

workﬂows to reference benchmark implementations; Workﬂow

systems developers can leverage these benchmarks as part of

their continuous integration processes; Computing facilities

can assess the performance of their systems beyond the

traditional HPC benchmark implementations; Workﬂow prac-

titioners can use these benchmarks to perform fair comparison

of competing workﬂow systems.

II. RELATED WORK

The ﬁeld of HPC has seen the development of many bench-

mark suites [4]–[6], [20]–[27]. For instance, SPEC combines

knowledge of performance evaluation with the resources to

maintain a benchmarking effort by bringing together bench-

marking and market experts, and customers needs in HPC. In

1994, SPEC’s HPG emerged to extend the evaluation activities

by establishing and maintaining a benchmark suite represen-

tative of real-world HPC applications [27]. More recently,

the SPEChpc 2021 suite provides a group of strong-scaled

application-based benchmarks including metrics per workload

size. Although easily reproducible, its design limits limit its

broadly applicability [28]. HPC settings have been historically

structured around relatively stable technologies and practices

(e.g., monolithic parallel programs applications that use MPI).

Recent work [9] has proposed separating the system-speciﬁc

implementation from the speciﬁcation of the benchmarks, so as

to target different runtime systems. This is also the philosophy

adopted in this work and our benchmarks could easily be

implemented within the framework in [9], which currently

does not include workﬂow-speciﬁc benchmarks.

Some researchers have investigated the automatic generation

of representative benchmarks. For instance, Logan et al. [29]

leverage the notion of skeletons to study the I/O behaviors of

real-world applications. Their approach consists in suppressing

computational parts of parallel applications, so that only

communication and I/O operations remain. Users can then run

the resulting benchmarks, which exhibit the complex I/O and

communication patterns of real-world applications, without

having to experience long execution times. Similarly, Hao

et al. [30] leverage execution traces from real-world parallel

applications to automatically generate synthetic MPI programs

that mimic the I/O behaviors of these applications without

having to execute their computational segments.

In this work, we focus on scientiﬁc workﬂow applications.

Some studies have proposed to use particular domain-speciﬁc

workﬂows as benchmarks [14]–[17], [31]. For instance, Kr-

ishnan et al. [31] propose a benchmark for complex clinical

diagnostic pipelines, in which a particular conﬁguration of a

production pipeline is used as a benchmark. Although these

benchmarks are by deﬁnition representative of a real-world

application, they are limited to particular scientiﬁc domains

and application conﬁgurations. To address this limitation, Katz

et al. [18] and Coleman et al. [19] have proposed approaches

for generating synthetic workﬂow conﬁgurations based on rep-

resentative commonly used workﬂow patterns. The limitations

are that these works only generate abstract speciﬁcations of

workﬂow task graphs, which is only one of the required

components of an executable workﬂow benchmark. To the

best of our knowledge, this study is the ﬁrst to propose a

generic workﬂow benchmark generation method that makes it

possible to generate executable workﬂow benchmarks that can

be conﬁgured by the users to be representative of a wide range

of relevant scientiﬁc workﬂow conﬁgurations.

III. APPROACH

Developing a workﬂow benchmark requires developing

(i) representative benchmarks of workﬂow tasks and (ii) rep-

resentative benchmarks of workﬂows that consist of multiple

tasks with data dependencies. We discuss our approach for

each of the above in the next two sections.

A. Developing Representative Workﬂow Task Benchmarks

Workﬂow tasks have different characteristics in terms of

compute-, memory-, and I/O-intensiveness [10], which impact

workﬂow performance differently on different architectures.

Consequently, a workﬂow benchmark generation tool should

be conﬁgurable, by the user, so that generated benchmark

workﬂow tasks can exhibit arbitrary such characteristics.

We have developed a generic benchmark (implemented in

Python) that, based on user-provided parameters, launches

instances of different I/O-, CPU-, and/or memory-intensive

operations. The benchmark executions proceeds in three con-

secutive phases1:

#1 Read input from disk: Given a binary ﬁle, this phase

of the benchmark simply opens the ﬁle with the “rb”

option and calls file.readlines() to read the ﬁle

content from disk, in a single thread.

#2 Compute: This phase is conﬁgured by a number of cores

(n), a total amount of CPU work (cpuwork) to perform,

a total amount of memory work to perform (memwork),

and the fraction of the computation’s instructions that

correspond to non-memory operations (f), which, for

now, must be a multiple of 0.1. This phase starts ngroups

of 10 threads, where threads in the same group are pinned

to the same CPU core (using set_affinity). Within

each group, 10 ×(1 −f)threads run a memory-intensive

executable (compiled C++) that computes random ac-

cesses to positions in an array in which one unit is added

to each position up to the total amount of memory work

(memwork) has been performed; and 10 ×fthreads

run a CPU-intensive executable (compiled C++) that

calculates an increasingly precise value of πup until

the speciﬁed total amount of computation (cpuwork)

has been performed. In this manner, our benchmark

uses both CPU and memory resources, and parameter

fdeﬁnes the relative use of these resources. For both

CPU and memory, the threads are instances of python’s

subprocess calling the benchmarks executable.

#3 Write output to disk: Given a number of bytes, this

phase simply opens an empty binary ﬁle with the “wb”

option and calls file.write() to write random bytes

to disk in a single thread.

This above approach is relatively simple and makes several

assumptions that do not necessarily hold true for real-world

workﬂow tasks. For instance, I/O operations could overlap

with computation, and there could be many I/O and compute

phases. Furthermore, our implementation of the compute phase

(phase #2) on the CPU uses multiple threads that can have

complex interference in terms of resource usage (e.g., cache

vs. main memory use). Furthermore, due to our use of 10

threads per core, there is context-switching overhead that likely

does not occur with real-world workﬂow tasks. Finally, due to

our use of only 10 threads, fcan only take discrete values

(multiples of 0.1), which does not make it possible to capture

arbitrary non-memory/memory operation mixes. Nevertheless,

we claim that this approach makes it possible to instantiate

1We do not use stress test tools such as stress-ng (e.g., using

--vm-bytes or --vm-keep for creating memory pressure, or --hdd

or --hdd-bytes for performing I/O operations) as it does not generate

a precise amount of memory operations or actual ﬁles that could be used

downstream in the workﬂow.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WfBench:AutomatedGenerationofScienticWorkowBenchmarksTainaColeman,HenriCasanovayKetanMaheshwariz,Lo¨cPottier,SeanR.WilkinsonzJustinWozniakx,Fr´ed´ericSuterz,MallikarjunShankarz,RafaelFerreiradaSilvazUniversityofSouthernCalifornia,MarinadelRey,CA,USAyUniversityofHawaii,Honolulu,HI,USAxArgonneN...

展开>> 收起<<

WfBench Automated Generation of Scientiﬁc Workﬂow Benchmarks Taina Coleman Henri CasanovayKetan Maheshwariz Loıc Pottier Sean R. Wilkinsonz.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

WfBench Automated Generation of Scientiﬁc Workﬂow Benchmarks Taina Coleman Henri CasanovayKetan Maheshwariz Loıc Pottier Sean R. Wilkinsonz

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: