HPC Storage Service Autotuning Using Variational-Autoencoder-Guided Asynchronous Bayesian Optimization

2025-04-29 0 0 1002.39KB 13 页 10玖币
侵权投诉
HPC Storage Service Autotuning Using
Variational-Autoencoder-Guided
Asynchronous Bayesian Optimization
Matthieu Dorier§, Romain Egele§, Prasanna Balaprakash, Jaehoon Koo,
Sandeep Madireddy, Srinivasan Ramesh, Allen D. Malony, and Rob Ross
Argonne National Laboratory, Lemont, IL – {mdorier,pbalapra,jkoo,smadireddy,rross}@anl.gov
Universit´
e Paris-Saclay, France – romain.egele@universite-paris-saclay.fr
University of Oregon, Eugene, OR – {sramesh,malony}@cs.uoregon.edu
Abstract—Distributed data storage services tailored to specific
applications have grown popular in the high-performance com-
puting (HPC) community as a way to address I/O and storage
challenges. These services offer a variety of specific interfaces,
semantics, and data representations. They also expose many
tuning parameters, making it difficult for their users to find
the best configuration for a given workload and platform.
To address this issue, we develop a novel variational-
autoencoder-guided asynchronous Bayesian optimization method
to tune HPC storage service parameters. Our approach uses
transfer learning to leverage prior tuning results and use a
dynamically updated surrogate model to explore the large pa-
rameter search space in a systematic way.
We implement our approach within the DeepHyper open-
source framework, and apply it to the autotuning of a high-
energy physics workflow on Argonne’s Theta supercomputer.
We show that our transfer-learning approach enables a more
than 40×search speedup over random search, compared with
a2.5×to 10×speedup when not using transfer learning.
Additionally, we show that our approach is on par with state-of-
the-art autotuning frameworks in speed and outperforms them
in resource utilization and parallelization capabilities.
Index Terms—HPC, Autotuning, Storage, I/O, Transfer Learn-
ing, Bayesian Optimization, DeepHyper, Mochi
I. INTRODUCTION
Distributed data and input/output (I/O) services have be-
come popular in high-performance computing (HPC) to re-
place traditional parallel file systems [1]. They range from
multiuser, high-speed storage systems such as burst buffers [2],
[3], [4], to transient, application-specific services providing
processing capabilities such as in situ analysis [5], [6], [7].
These systems aim to improve I/O and storage performance
by moving away from file-based interfaces and from the
POSIX semantics, instead providing specific interfaces and op-
timizations that can be tailored to individual applications. An
example of such a distributed storage service is HEPnOS [8],
an in-memory object store for high-energy physics (HEP)
applications developed by Argonne National Laboratory and
FermiLab.
Like parallel file systems, data services can be incredibly
complicated to configure and tune. Contrary to parallel file
systems, however, they typically live in user-space and provide
§These authors contributed equally to the work.
more ways for the user to configure them for their specific
use-case. They consist of many inter-related software compo-
nents that handle different aspects of the service (threading,
networking, storage, scheduling), each providing a number of
parameters that can be tuned for best performance. From the
composition of these building blocks, there emerge even more
ways to configure the whole service, such as deciding how
they share common resources (CPUs, cores, memory, network,
storage devices) and how they are deployed on the physical
hardware. Applications that use these services also become
more difficult to configure, especially as optimizations such
as batching, collective I/O, prefetching, and asynchronous I/O
come into play and as these applications are chained together
to form workflows. A configuration of the storage service that
works well for one step of the workflow may perform poorly
for the next, or at a different scale, or on another platform.
Given the complexity of these data services, manual tuning
is cumbersome and time consuming at best and can easily
lead to missed opportunities for better configurations. Hence,
a critical need exists for tools and methods that automatically
tune not just data services but the entire workflows that use
them, searching for well-performing configurations in a given
context, and doing so while consuming little time and few
resources.
Empirical performance tuning, also known as autotuning, is
a hot topic in software optimization nowadays, and a promis-
ing approach for HPC storage service tuning. In this approach,
the user exposes the tunable parameters and defines the range
of values that each parameter can take; a search method
is then used to explore the parameter space by executing
different parameter configurations on the target platform. The
challenge for HPC storage services autotuning stems from
the complexity of the workflow and the search space. First,
several tunable parameters can be interdependent, requiring an
execution of the complete workflow on the target platform for
a given parameter configuration. Consequently, each parameter
evaluation can become expensive. Second, the large number of
parameters gives rise to a large search space, which requires
sophisticated search methods that can find high-performing
configurations in a reasonable search time. Third, given the
arXiv:2210.00798v1 [cs.DC] 3 Oct 2022
availability of HPC resources, search methods should leverage
them to scale and reduce search time and improve solution
quality. Fourth, HPC storage service tuning is not a one-
time campaign. Due to changes in workloads, software, and
platforms, one needs to run autotuning regularly. While au-
totuning as a whole is a computationally expensive process,
the similarity in the autotuning tasks presents opportunities
for leveraging the knowledge gained from one autotuning
campaign to the next. Examples include (1) the user deciding
to increase the budget for tuning, but seeking to use the results
from a number of smaller autotuning runs that were performed
previously; (2) the user seeking to leverage autotuning results
from a small scale to speed up the autotuning for large-
scale runs; or (3) the user introducing new parameters for
autotuning but seeking to reuse the results for old parameters
from previous runs.
From a mathematical optimization perspective, the autotun-
ing problem can be formulated as the mixed integer nonlinear
optimization problem with computationally expensive black-
box objective function, one of the hardest optimization prob-
lems to solve. Bayesian optimization is one of the promising
approaches for solving these classes of problems [9], [10],
[11]. Typically, Bayesian optimization is applied in a sequen-
tial setting, where the search proposes one parameter configu-
ration for evaluation at each iteration. Distributed Bayesian
optimization methods leverage HPC resources to perform
simultaneous parameter evaluations to find high-performing
configurations in a short wall clock-time. Based on the way in
which the simultaneous parameter evaluations are performed,
distributed Bayesian optimization can be grouped into batch
synchronous and asynchronous methods. In the former, the
search selects a batch of parameter configurations and waits
until the evaluations are completed before proceeding to the
next iteration. However, this approach results in the waste
of HPC resources when the evaluation times are different
or diverse. Specifically, the HPC nodes that completed the
evaluations faster have to wait until every other evaluations
have completed. The asynchronous method overcomes this
issue; here as soon as an evaluation is completed, the search
method uses the evalaution result and suggests a new param-
eter configuration for evaluation. In our setting, given that
we are optimizing the run time, the parameter configurations
that complete the evaluation faster will update the model
frequently and thus increase the chance of sampling more
high-performing configurations.
In this paper we develop and apply a new transfer-learning-
based search method to tune the parameters of HPC storage
services. Our approach adopts a distributed, asynchronous
Bayesian optimization method that (1) transfers the results
from related autotuning runs; (2) relies on a dynamically
updated and computationally cheap surrogate model to learn
the relationship between input configurations and the observed
performance; (3) uses the surrogate model to navigate the
search space by simultaneously evaluating multiple input
configurations; and (4) finds high-performing configurations
by evaluating fewer input configurations on the platform.
Frameworks such as DeepHyper [12], which we rely on in
this work, provide distributed asynchronous BO capabilities.
The key novelty of our proposed method consists in the way
it performs systematic transfer learning that leverages the
results from similar or related tuning runs. This is achieved
by using the variational autoencoder, a neural-network-based
generative modeling approach, to capture the cross correlations
of the high performing configurations of the prior autotuning
runs. The joint probability distribution of the high performing
configurations is then used to guide the search within Bayesian
optimization. This enables the search to focus on promising
regions of the search space from the start of the search
and signficantly reduces the time required for finding high-
performing configurations on new related autotuning runs. Our
key contributions in this paper are the following.
We propose variational-autoencoder-guided asynchronous
Bayesian optimization (VAE-ABO), a new approach for tun-
ing the parameters of HPC storage services. We implement
VAE-ABO within the DeepHyper framework and make our
approach available as an open source software.
We demonstrate the efficacy of the proposed method in a
number of settings. Using a high-energy physics workflow,
we show that transfer-learning speeds up the search for high-
performing configurations and enables faster convergence
towards the best configurations, increased resource utiliza-
tion, and increased number of evaluations.
II. MOTIVATING USE-CASE: HEPNOS AND THE NOVA
EVENT-SELECTION WORKFLOW
Since 2018, the SciDAC “HEP-on-HPC” [13] project has
brought together physicists and computer scientists from a
number of institutions to solve the challenges of leveraging
HPC resources to solve large-scale HEP problems. One out-
come of this project was HEPnOS, a storage service designed
by Argonne National Laboratory to store and provide access
to billions of “event” data produced by particle accelerators.
HEPnOS serves as a centerpiece for a number of HEP analytic
workflows, one of which is the event-selection workflow [14]
for the NOvA experiment [15], [16].
As a highly configurable data service, HEPnOS can be
finely tuned for specific use cases. However, finding an op-
timal configuration for a particular application on a particular
platform remains a challenge. Researchers at Argonne and
Fermilab have spent the past few years trying to understand
how each parameter influences performance, relying on trial
and error, manual tuning, benchmarking, and profiling. In this
paper we aim to automatize this tuning process by using
parameter-space exploration and Bayesian optimization.
This section dives into some technical details of HEPnOS
and the event-selection workflow relevant to our endeavor.
A. HEPnOS storage system
HEPnOS uses components from the Mochi project [1]. Thus
it relies on Mercury [17] for remote procedure calls and remote
direct memory access (RDMA), and on Argobots [18] for
thread management. The Margo [19] library binds Mercury
and Argobots together to provide a simple abstraction for
developing remotely accessible microservices.
On top of Margo, HEPnOS uses the Yokan microser-
vice [20] to provide key/value storage capabilities and the
Bedrock microservice [21] to provide bootstrapping and con-
figuration capabilities.
HEPnOS stores HEP data in the form of a hierarchy of
datasets,runs,subruns,events, and products, the products
carrying most of the payload in the form of serialized C++
objects. These constructs are mapped onto a flat key/value
namespace in a distributed set of Yokan databases instances.
For more technical details on how this mapping is done, we
refer the reader to HEPnOS’ extensive online documentation.1
Of interest to the present work is the fact that some
configuration parameters of HEPnOS are critical to its per-
formance, including its number of database instances, how
these databases map to threads, how it schedules its operations,
down to low-level decisions such as whether to use blocking
epoll or busy spinning for network progress.
Thanks to the Mochi Bedrock component, which provides
configuration and bootstrapping capabilities to Mochi services,
all these parameters can easily be provided from a single JSON
file that describes which components form the service and
how they should be configured. This extensive configurability
is critical to the work presented in this paper, and is what
distinguishes a storage service such as HEPnOS from more
traditional storage systems such as a parallel file system.
B. NOvA event-selection workflow
As shown in Figure 1, HEPnOS is used as a distributed, in-
memory storage system for HEP workflows. In this work we
focus on the NOvA event-selection workflow, which consists
of two steps: data loading, and parallel event processing. A
set of HDF5 files containing tables of event data is read by
a parallel application, the data loader, which converts them
into arrays of C++ objects that are then stored in HEPnOS as
products associated with events. In the event-selection step, all
the events contained in a given dataset are read and processed
in parallel to search for events matching specific criteria.
1) Data loading
In practice, the event selection workflow does not operate
directly on data as it is produced by the particle accelerator.
The raw data is first stored into files, either in HDF5 [22]
or in ROOT [23] format. While these files can be shared
across institutions easily, they produce an I/O bottleneck when
it comes to reading them from a large number of processes.
Hence in HEPnOS-based workflows they need to be loaded
into HEPnOS prior to their data being processed.
The dataloader is in charge of this task. It is a parallel, MPI-
based application that takes a list of HDF5 files, converts them
into C++ objects, and stores them into HEPnOS. Since the
amount of data differs across HDF5 files, the dataloader does
not distribute the work in a static manner across its processes.
Instead, a list of files is maintained in one process, and all the
processes pull work from this shared list of files until all the
files have been loaded.
1https://hepnos.readthedocs.io
Several optimizations are available in the dataloader, includ-
ing batching of events and products (the mapping of events
and products to HDF5 files on the one hand and to databases
in HEPnOS on the other hand is such that all the events
coming from the same file will end up in the same database,
and similarly for products) and overlapping the loading of
a file with the storage of data from a previous file into
HEPnOS. These optimizations can be turned on or off and
configured in various ways. Along with job-related parameters
(number processes, number of threads, mapping to CPUs), the
dataloader offers many configuration parameters that can be
tuned to achieve good performance.
2) Parallel event processing
The second step of the workflow, parallel event processing
(PEP), consists of reading the events and some products
associated with them, and performing some computation on
the data to determine events of interest. If events are stored
in Ndatabases in HEPnOS, Nprocesses of the PEP appli-
cation will list them, each accessing one database. They will
end up filling a local list of events (<dataset id, run
number, subrun number, event number> tuples).
All the processes pull events from either their local queue
or by requesting batches from other processes. Each event is
processed first by loading the data products associated with it,
then by performing computation on these products.
The PEP application provides a benchmark of its I/O part
(loading events and products) that simulates computation. We
use this benchmark in place of the real PEP application in
this paper since we are interested in autotuning only the I/O
aspects of this workflow.
Just like HEPnOS and the data loader, optimizations are
in place to improve I/O performance in the PEP application
and benchmark: look-ahead prefetching when reading from
HEPnOS, batching of events when they are loaded and when
they are sent from one process to another, batching of data
products, and multithreaded processing of events inside each
process. All these optimizations come with their own set of
tunable parameters that can influence the overall performance
of the workflow.
C. Challenges of (auto)tuning the workflow
While the parameters of a single workflow component
cannot be tuned independently from one another, they also
cannot be tuned independently from the parameters of other
components. As an example, what could seem like an optimal
number of threads in the PEP application could in turn
influence the optimal number of databases in HEPnOS, which
could influence the batch size used by the dataloader when
storing events into HEPnOS. Manually tuning such a workflow
becomes rapidly intractable, in particular as new optimizations
(hence new parameters) are implemented, new steps are added
to the workflow when the workflow scales up, or when it
is ported to a new platform. This situation motivated us to
investigate ways of automatically tuning such a workflow
using parameter space exploration and machine learning.
Parameter-space exploration enables defining the list of
摘要:

HPCStorageServiceAutotuningUsingVariational-Autoencoder-GuidedAsynchronousBayesianOptimizationMatthieuDorier§,RomainEgeley§,PrasannaBalaprakash,JaehoonKoo,SandeepMadireddy,SrinivasanRameshz,AllenD.Malonyz,andRobRossArgonneNationalLaboratory,Lemont,IL–fmdorier,pbalapra,jkoo,smadireddy,rrossg@a...

展开>> 收起<<
HPC Storage Service Autotuning Using Variational-Autoencoder-Guided Asynchronous Bayesian Optimization.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:1002.39KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注