Spot-on A Checkpointing Framework for Fault-Tolerant Long-running Workloads on Cloud Spot Instances

2025-05-03 0 0 281.07KB 3 页 10玖币

侵权投诉

Spot-on: A Checkpointing Framework for

Fault-Tolerant Long-running Workloads on Cloud

Spot Instances

Ashley Tung†, Haiyan Wang†, Yue Li†, Zhong Wang∗, and Jingchao Sun†

†MemVerge Inc., Milpitas, CA

∗Department of Energy Joint Genome Institute, Berkeley, CA

yue.li@memverge.com

Abstract—Spot instances offer a cost-effective solution for ap-

plications running in the cloud computing environment. However,

it is challenging to run long-running jobs on spot instances

because they are subject to unpredictable evictions. Here, we

present Spot-on, a generic software framework that supports

fault-tolerant long-running workloads on spot instances through

checkpoint and restart. Spot-on leverages existing checkpointing

packages and is compatible with the major cloud vendors. Using a

genomics application as a test case, we demonstrated that Spot-on

supports both application-speciﬁc and transparent checkpointing

methods. Compared to running applications using on-demand

instances, it allows the completion of these workloads for a

signiﬁcant reduction in computing costs. Compared to running

applications using application-speciﬁc checkpoint mechanisms,

transparent checkpoint-protected applications reduce runtime by

up to 40%, leading to further cost savings of up to 86%.

I. INTRODUCTION

Major cloud vendors offer “spot virtual machine (VM)

instances” that utilize spare computing resources at steep dis-

counts [1] [2] [3]. However, a spot instance can be reclaimed

during a resource shortage with a short notice seconds or

minutes before a reclamation. Upon reclamation, all workloads

running on the instances are terminated, and the instance

is destroyed. This unpredictable nature makes it challenging

to run long-running workloads on spot instances without

checking points. This is not unlike Amazon EC2’s spot market

used in Proteus( [14]) and Tributary ( [15]). What sets Azure

spot instances apart is that there is no need to bid for any new

resources. Rather, the user is able to choose a VM size and

simply have the option to turn it into a Spot instance.

Checkpoint solutions developed in high-performance com-

puting systems can be adapted for the cloud environment [5]

[8] [6]. Both application-speciﬁc and transparent checkpoint-

ing technologies may be leveraged so that checkpoints can be

made on one spot instance and moved to restart on another

when the previous instance is reclaimed. However, to imple-

ment a practical solution that is user-friendly requires careful

integration with all the cloud platforms and schedulers to prop-

erly schedule, store, transfer, and restart checkpoints. In this

work, we implemented a practical framework called “Spot-

on” by integrating with the major cloud vendor’s spot instance

scheduler to evaluate the impact of checkpointing mechanisms

on running time and cost of long-running workloads. We used

a case study, a long-running metagenome assembly workload

(metaSPAdes, [9]), to compare the checkpointing methods on

Azure on-demand and spot instances. We found that both

checkpointing methods enable fault-tolerance metaSPAdes

workloads on spot instances to reduce cost. Compared to

using application-speciﬁc checkpointing mechanisms on spot

instances, metaSPAdes protected by transparent checkpointing

takes less time to ﬁnish, which leads to further cost reductions.

II. ARCHITECTURE AND DESIGN

The Spot-on checkpoint and restart workﬂow framework is

illustrated in Fig. 1. When a workload is launched on the

Fig. 1. The Spot-on Checkpoint and Restart Workﬂow across spot instances.

spot instance, a checkpoint coordinator, Spot-On, is launched

simultaneously. Running the coordinator does not provide

additional monetary cost to the user,as it is essentially a script

running in parallel to metaSPAdes. The coordinator has the

responsibility for checkpointing and restoration: it schedules

periodic checkpointing and monitors VM eviction events using

APIs provided by the cloud. Upon detecting an eviction

event, the coordinator creates a “termination checkpoint” in

addition to periodic checkpoints. Unlike the periodic check-

points, termination checkpoints are opportunistic due to their

possible failures caused by the short eviction notiﬁcation (e.g.

arXiv:2210.02589v1 [cs.DC] 5 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Spot-on:ACheckpointingFrameworkforFault-TolerantLong-runningWorkloadsonCloudSpotInstancesAshleyTungy,HaiyanWangy,YueLiy,ZhongWang,andJingchaoSunyyMemVergeInc.,Milpitas,CADepartmentofEnergyJointGenomeInstitute,Berkeley,CAyue.li@memverge.comAbstractSpotinstancesofferacost-effectivesolutionforap-pli...

展开>> 收起<<

Spot-on A Checkpointing Framework for Fault-Tolerant Long-running Workloads on Cloud Spot Instances.pdf

共3页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Spot-on A Checkpointing Framework for Fault-Tolerant Long-running Workloads on Cloud Spot Instances

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: