HPC Storage Service Autotuning Using Variational-Autoencoder-Guided Asynchronous Bayesian Optimization

2025-04-29 0 0 1002.39KB 13 页 10玖币

侵权投诉

HPC Storage Service Autotuning Using

Variational-Autoencoder-Guided

Asynchronous Bayesian Optimization

Matthieu Dorier∗§, Romain Egele∗†§, Prasanna Balaprakash∗, Jaehoon Koo∗,

Sandeep Madireddy∗, Srinivasan Ramesh‡, Allen D. Malony‡, and Rob Ross∗

∗Argonne National Laboratory, Lemont, IL – {mdorier,pbalapra,jkoo,smadireddy,rross}@anl.gov

†Universit´

e Paris-Saclay, France – romain.egele@universite-paris-saclay.fr

‡University of Oregon, Eugene, OR – {sramesh,malony}@cs.uoregon.edu

Abstract—Distributed data storage services tailored to speciﬁc

applications have grown popular in the high-performance com-

puting (HPC) community as a way to address I/O and storage

challenges. These services offer a variety of speciﬁc interfaces,

semantics, and data representations. They also expose many

tuning parameters, making it difﬁcult for their users to ﬁnd

the best conﬁguration for a given workload and platform.

To address this issue, we develop a novel variational-

autoencoder-guided asynchronous Bayesian optimization method

to tune HPC storage service parameters. Our approach uses

transfer learning to leverage prior tuning results and use a

dynamically updated surrogate model to explore the large pa-

rameter search space in a systematic way.

We implement our approach within the DeepHyper open-

source framework, and apply it to the autotuning of a high-

energy physics workﬂow on Argonne’s Theta supercomputer.

We show that our transfer-learning approach enables a more

than 40×search speedup over random search, compared with

a2.5×to 10×speedup when not using transfer learning.

Additionally, we show that our approach is on par with state-of-

the-art autotuning frameworks in speed and outperforms them

in resource utilization and parallelization capabilities.

Index Terms—HPC, Autotuning, Storage, I/O, Transfer Learn-

ing, Bayesian Optimization, DeepHyper, Mochi

I. INTRODUCTION

Distributed data and input/output (I/O) services have be-

come popular in high-performance computing (HPC) to re-

place traditional parallel ﬁle systems [1]. They range from

multiuser, high-speed storage systems such as burst buffers [2],

[3], [4], to transient, application-speciﬁc services providing

processing capabilities such as in situ analysis [5], [6], [7].

These systems aim to improve I/O and storage performance

by moving away from ﬁle-based interfaces and from the

POSIX semantics, instead providing speciﬁc interfaces and op-

timizations that can be tailored to individual applications. An

example of such a distributed storage service is HEPnOS [8],

an in-memory object store for high-energy physics (HEP)

applications developed by Argonne National Laboratory and

FermiLab.

Like parallel ﬁle systems, data services can be incredibly

complicated to conﬁgure and tune. Contrary to parallel ﬁle

systems, however, they typically live in user-space and provide

§These authors contributed equally to the work.

more ways for the user to conﬁgure them for their speciﬁc

use-case. They consist of many inter-related software compo-

nents that handle different aspects of the service (threading,

networking, storage, scheduling), each providing a number of

parameters that can be tuned for best performance. From the

composition of these building blocks, there emerge even more

ways to conﬁgure the whole service, such as deciding how

they share common resources (CPUs, cores, memory, network,

storage devices) and how they are deployed on the physical

hardware. Applications that use these services also become

more difﬁcult to conﬁgure, especially as optimizations such

as batching, collective I/O, prefetching, and asynchronous I/O

come into play and as these applications are chained together

to form workﬂows. A conﬁguration of the storage service that

works well for one step of the workﬂow may perform poorly

for the next, or at a different scale, or on another platform.

Given the complexity of these data services, manual tuning

is cumbersome and time consuming at best and can easily

lead to missed opportunities for better conﬁgurations. Hence,

a critical need exists for tools and methods that automatically

tune not just data services but the entire workﬂows that use

them, searching for well-performing conﬁgurations in a given

context, and doing so while consuming little time and few

resources.

Empirical performance tuning, also known as autotuning, is

a hot topic in software optimization nowadays, and a promis-

ing approach for HPC storage service tuning. In this approach,

the user exposes the tunable parameters and deﬁnes the range

of values that each parameter can take; a search method

is then used to explore the parameter space by executing

different parameter conﬁgurations on the target platform. The

challenge for HPC storage services autotuning stems from

the complexity of the workﬂow and the search space. First,

several tunable parameters can be interdependent, requiring an

execution of the complete workﬂow on the target platform for

a given parameter conﬁguration. Consequently, each parameter

evaluation can become expensive. Second, the large number of

parameters gives rise to a large search space, which requires

sophisticated search methods that can ﬁnd high-performing

conﬁgurations in a reasonable search time. Third, given the

arXiv:2210.00798v1 [cs.DC] 3 Oct 2022

availability of HPC resources, search methods should leverage

them to scale and reduce search time and improve solution

quality. Fourth, HPC storage service tuning is not a one-

time campaign. Due to changes in workloads, software, and

platforms, one needs to run autotuning regularly. While au-

totuning as a whole is a computationally expensive process,

the similarity in the autotuning tasks presents opportunities

for leveraging the knowledge gained from one autotuning

campaign to the next. Examples include (1) the user deciding

to increase the budget for tuning, but seeking to use the results

from a number of smaller autotuning runs that were performed

previously; (2) the user seeking to leverage autotuning results

from a small scale to speed up the autotuning for large-

scale runs; or (3) the user introducing new parameters for

autotuning but seeking to reuse the results for old parameters

from previous runs.

From a mathematical optimization perspective, the autotun-

ing problem can be formulated as the mixed integer nonlinear

optimization problem with computationally expensive black-

box objective function, one of the hardest optimization prob-

lems to solve. Bayesian optimization is one of the promising

approaches for solving these classes of problems [9], [10],

[11]. Typically, Bayesian optimization is applied in a sequen-

tial setting, where the search proposes one parameter conﬁgu-

ration for evaluation at each iteration. Distributed Bayesian

optimization methods leverage HPC resources to perform

simultaneous parameter evaluations to ﬁnd high-performing

conﬁgurations in a short wall clock-time. Based on the way in

which the simultaneous parameter evaluations are performed,

distributed Bayesian optimization can be grouped into batch

synchronous and asynchronous methods. In the former, the

search selects a batch of parameter conﬁgurations and waits

until the evaluations are completed before proceeding to the

next iteration. However, this approach results in the waste

of HPC resources when the evaluation times are different

or diverse. Speciﬁcally, the HPC nodes that completed the

evaluations faster have to wait until every other evaluations

have completed. The asynchronous method overcomes this

issue; here as soon as an evaluation is completed, the search

method uses the evalaution result and suggests a new param-

eter conﬁguration for evaluation. In our setting, given that

we are optimizing the run time, the parameter conﬁgurations

that complete the evaluation faster will update the model

frequently and thus increase the chance of sampling more

high-performing conﬁgurations.

In this paper we develop and apply a new transfer-learning-

based search method to tune the parameters of HPC storage

services. Our approach adopts a distributed, asynchronous

Bayesian optimization method that (1) transfers the results

from related autotuning runs; (2) relies on a dynamically

updated and computationally cheap surrogate model to learn

the relationship between input conﬁgurations and the observed

performance; (3) uses the surrogate model to navigate the

search space by simultaneously evaluating multiple input

conﬁgurations; and (4) ﬁnds high-performing conﬁgurations

by evaluating fewer input conﬁgurations on the platform.

Frameworks such as DeepHyper [12], which we rely on in

this work, provide distributed asynchronous BO capabilities.

The key novelty of our proposed method consists in the way

it performs systematic transfer learning that leverages the

results from similar or related tuning runs. This is achieved

by using the variational autoencoder, a neural-network-based

generative modeling approach, to capture the cross correlations

of the high performing conﬁgurations of the prior autotuning

runs. The joint probability distribution of the high performing

conﬁgurations is then used to guide the search within Bayesian

optimization. This enables the search to focus on promising

regions of the search space from the start of the search

and signﬁcantly reduces the time required for ﬁnding high-

performing conﬁgurations on new related autotuning runs. Our

key contributions in this paper are the following.

•We propose variational-autoencoder-guided asynchronous

Bayesian optimization (VAE-ABO), a new approach for tun-

ing the parameters of HPC storage services. We implement

VAE-ABO within the DeepHyper framework and make our

approach available as an open source software.

•We demonstrate the efﬁcacy of the proposed method in a

number of settings. Using a high-energy physics workﬂow,

we show that transfer-learning speeds up the search for high-

performing conﬁgurations and enables faster convergence

towards the best conﬁgurations, increased resource utiliza-

tion, and increased number of evaluations.

II. MOTIVATING USE-CASE: HEPNOS AND THE NOVA

EVENT-SELECTION WORKFLOW

Since 2018, the SciDAC “HEP-on-HPC” [13] project has

brought together physicists and computer scientists from a

number of institutions to solve the challenges of leveraging

HPC resources to solve large-scale HEP problems. One out-

come of this project was HEPnOS, a storage service designed

by Argonne National Laboratory to store and provide access

to billions of “event” data produced by particle accelerators.

HEPnOS serves as a centerpiece for a number of HEP analytic

workﬂows, one of which is the event-selection workﬂow [14]

for the NOvA experiment [15], [16].

As a highly conﬁgurable data service, HEPnOS can be

ﬁnely tuned for speciﬁc use cases. However, ﬁnding an op-

timal conﬁguration for a particular application on a particular

platform remains a challenge. Researchers at Argonne and

Fermilab have spent the past few years trying to understand

how each parameter inﬂuences performance, relying on trial

and error, manual tuning, benchmarking, and proﬁling. In this

paper we aim to automatize this tuning process by using

parameter-space exploration and Bayesian optimization.

This section dives into some technical details of HEPnOS

and the event-selection workﬂow relevant to our endeavor.

A. HEPnOS storage system

HEPnOS uses components from the Mochi project [1]. Thus

it relies on Mercury [17] for remote procedure calls and remote

direct memory access (RDMA), and on Argobots [18] for

thread management. The Margo [19] library binds Mercury

and Argobots together to provide a simple abstraction for

developing remotely accessible microservices.

On top of Margo, HEPnOS uses the Yokan microser-

vice [20] to provide key/value storage capabilities and the

Bedrock microservice [21] to provide bootstrapping and con-

ﬁguration capabilities.

HEPnOS stores HEP data in the form of a hierarchy of

datasets,runs,subruns,events, and products, the products

carrying most of the payload in the form of serialized C++

objects. These constructs are mapped onto a ﬂat key/value

namespace in a distributed set of Yokan databases instances.

For more technical details on how this mapping is done, we

refer the reader to HEPnOS’ extensive online documentation.1

Of interest to the present work is the fact that some

conﬁguration parameters of HEPnOS are critical to its per-

formance, including its number of database instances, how

these databases map to threads, how it schedules its operations,

down to low-level decisions such as whether to use blocking

epoll or busy spinning for network progress.

Thanks to the Mochi Bedrock component, which provides

conﬁguration and bootstrapping capabilities to Mochi services,

all these parameters can easily be provided from a single JSON

ﬁle that describes which components form the service and

how they should be conﬁgured. This extensive conﬁgurability

is critical to the work presented in this paper, and is what

distinguishes a storage service such as HEPnOS from more

traditional storage systems such as a parallel ﬁle system.

B. NOvA event-selection workﬂow

As shown in Figure 1, HEPnOS is used as a distributed, in-

memory storage system for HEP workﬂows. In this work we

focus on the NOvA event-selection workﬂow, which consists

of two steps: data loading, and parallel event processing. A

set of HDF5 ﬁles containing tables of event data is read by

a parallel application, the data loader, which converts them

into arrays of C++ objects that are then stored in HEPnOS as

products associated with events. In the event-selection step, all

the events contained in a given dataset are read and processed

in parallel to search for events matching speciﬁc criteria.

1) Data loading

In practice, the event selection workﬂow does not operate

directly on data as it is produced by the particle accelerator.

The raw data is ﬁrst stored into ﬁles, either in HDF5 [22]

or in ROOT [23] format. While these ﬁles can be shared

across institutions easily, they produce an I/O bottleneck when

it comes to reading them from a large number of processes.

Hence in HEPnOS-based workﬂows they need to be loaded

into HEPnOS prior to their data being processed.

The dataloader is in charge of this task. It is a parallel, MPI-

based application that takes a list of HDF5 ﬁles, converts them

into C++ objects, and stores them into HEPnOS. Since the

amount of data differs across HDF5 ﬁles, the dataloader does

not distribute the work in a static manner across its processes.

Instead, a list of ﬁles is maintained in one process, and all the

processes pull work from this shared list of ﬁles until all the

ﬁles have been loaded.

1https://hepnos.readthedocs.io

Several optimizations are available in the dataloader, includ-

ing batching of events and products (the mapping of events

and products to HDF5 ﬁles on the one hand and to databases

in HEPnOS on the other hand is such that all the events

coming from the same ﬁle will end up in the same database,

and similarly for products) and overlapping the loading of

a ﬁle with the storage of data from a previous ﬁle into

HEPnOS. These optimizations can be turned on or off and

conﬁgured in various ways. Along with job-related parameters

(number processes, number of threads, mapping to CPUs), the

dataloader offers many conﬁguration parameters that can be

tuned to achieve good performance.

2) Parallel event processing

The second step of the workﬂow, parallel event processing

(PEP), consists of reading the events and some products

associated with them, and performing some computation on

the data to determine events of interest. If events are stored

in Ndatabases in HEPnOS, Nprocesses of the PEP appli-

cation will list them, each accessing one database. They will

end up ﬁlling a local list of events (<dataset id, run

number, subrun number, event number> tuples).

All the processes pull events from either their local queue

or by requesting batches from other processes. Each event is

processed ﬁrst by loading the data products associated with it,

then by performing computation on these products.

The PEP application provides a benchmark of its I/O part

(loading events and products) that simulates computation. We

use this benchmark in place of the real PEP application in

this paper since we are interested in autotuning only the I/O

aspects of this workﬂow.

Just like HEPnOS and the data loader, optimizations are

in place to improve I/O performance in the PEP application

and benchmark: look-ahead prefetching when reading from

HEPnOS, batching of events when they are loaded and when

they are sent from one process to another, batching of data

products, and multithreaded processing of events inside each

process. All these optimizations come with their own set of

tunable parameters that can inﬂuence the overall performance

of the workﬂow.

C. Challenges of (auto)tuning the workﬂow

While the parameters of a single workﬂow component

cannot be tuned independently from one another, they also

cannot be tuned independently from the parameters of other

components. As an example, what could seem like an optimal

number of threads in the PEP application could in turn

inﬂuence the optimal number of databases in HEPnOS, which

could inﬂuence the batch size used by the dataloader when

storing events into HEPnOS. Manually tuning such a workﬂow

becomes rapidly intractable, in particular as new optimizations

(hence new parameters) are implemented, new steps are added

to the workﬂow when the workﬂow scales up, or when it

is ported to a new platform. This situation motivated us to

investigate ways of automatically tuning such a workﬂow

using parameter space exploration and machine learning.

Parameter-space exploration enables deﬁning the list of

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HPCStorageServiceAutotuningUsingVariational-Autoencoder-GuidedAsynchronousBayesianOptimizationMatthieuDorier§,RomainEgeley§,PrasannaBalaprakash,JaehoonKoo,SandeepMadireddy,SrinivasanRameshz,AllenD.Malonyz,andRobRossArgonneNationalLaboratory,Lemont,ILfmdorier,pbalapra,jkoo,smadireddy,rrossg@a...

展开>> 收起<<

HPC Storage Service Autotuning Using Variational-Autoencoder-Guided Asynchronous Bayesian Optimization.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

HPC Storage Service Autotuning Using Variational-Autoencoder-Guided Asynchronous Bayesian Optimization

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: