Spot-on A Checkpointing Framework for Fault-Tolerant Long-running Workloads on Cloud Spot Instances

2025-05-03 0 0 281.07KB 3 页 10玖币
侵权投诉
Spot-on: A Checkpointing Framework for
Fault-Tolerant Long-running Workloads on Cloud
Spot Instances
Ashley Tung, Haiyan Wang, Yue Li, Zhong Wang, and Jingchao Sun
MemVerge Inc., Milpitas, CA
Department of Energy Joint Genome Institute, Berkeley, CA
yue.li@memverge.com
Abstract—Spot instances offer a cost-effective solution for ap-
plications running in the cloud computing environment. However,
it is challenging to run long-running jobs on spot instances
because they are subject to unpredictable evictions. Here, we
present Spot-on, a generic software framework that supports
fault-tolerant long-running workloads on spot instances through
checkpoint and restart. Spot-on leverages existing checkpointing
packages and is compatible with the major cloud vendors. Using a
genomics application as a test case, we demonstrated that Spot-on
supports both application-specific and transparent checkpointing
methods. Compared to running applications using on-demand
instances, it allows the completion of these workloads for a
significant reduction in computing costs. Compared to running
applications using application-specific checkpoint mechanisms,
transparent checkpoint-protected applications reduce runtime by
up to 40%, leading to further cost savings of up to 86%.
I. INTRODUCTION
Major cloud vendors offer “spot virtual machine (VM)
instances” that utilize spare computing resources at steep dis-
counts [1] [2] [3]. However, a spot instance can be reclaimed
during a resource shortage with a short notice seconds or
minutes before a reclamation. Upon reclamation, all workloads
running on the instances are terminated, and the instance
is destroyed. This unpredictable nature makes it challenging
to run long-running workloads on spot instances without
checking points. This is not unlike Amazon EC2’s spot market
used in Proteus( [14]) and Tributary ( [15]). What sets Azure
spot instances apart is that there is no need to bid for any new
resources. Rather, the user is able to choose a VM size and
simply have the option to turn it into a Spot instance.
Checkpoint solutions developed in high-performance com-
puting systems can be adapted for the cloud environment [5]
[8] [6]. Both application-specific and transparent checkpoint-
ing technologies may be leveraged so that checkpoints can be
made on one spot instance and moved to restart on another
when the previous instance is reclaimed. However, to imple-
ment a practical solution that is user-friendly requires careful
integration with all the cloud platforms and schedulers to prop-
erly schedule, store, transfer, and restart checkpoints. In this
work, we implemented a practical framework called “Spot-
on” by integrating with the major cloud vendor’s spot instance
scheduler to evaluate the impact of checkpointing mechanisms
on running time and cost of long-running workloads. We used
a case study, a long-running metagenome assembly workload
(metaSPAdes, [9]), to compare the checkpointing methods on
Azure on-demand and spot instances. We found that both
checkpointing methods enable fault-tolerance metaSPAdes
workloads on spot instances to reduce cost. Compared to
using application-specific checkpointing mechanisms on spot
instances, metaSPAdes protected by transparent checkpointing
takes less time to finish, which leads to further cost reductions.
II. ARCHITECTURE AND DESIGN
The Spot-on checkpoint and restart workflow framework is
illustrated in Fig. 1. When a workload is launched on the
Fig. 1. The Spot-on Checkpoint and Restart Workflow across spot instances.
spot instance, a checkpoint coordinator, Spot-On, is launched
simultaneously. Running the coordinator does not provide
additional monetary cost to the user,as it is essentially a script
running in parallel to metaSPAdes. The coordinator has the
responsibility for checkpointing and restoration: it schedules
periodic checkpointing and monitors VM eviction events using
APIs provided by the cloud. Upon detecting an eviction
event, the coordinator creates a “termination checkpoint” in
addition to periodic checkpoints. Unlike the periodic check-
points, termination checkpoints are opportunistic due to their
possible failures caused by the short eviction notification (e.g.
arXiv:2210.02589v1 [cs.DC] 5 Oct 2022
摘要:

Spot-on:ACheckpointingFrameworkforFault-TolerantLong-runningWorkloadsonCloudSpotInstancesAshleyTungy,HaiyanWangy,YueLiy,ZhongWang,andJingchaoSunyyMemVergeInc.,Milpitas,CADepartmentofEnergyJointGenomeInstitute,Berkeley,CAyue.li@memverge.comAbstract—Spotinstancesofferacost-effectivesolutionforap-pli...

展开>> 收起<<
Spot-on A Checkpointing Framework for Fault-Tolerant Long-running Workloads on Cloud Spot Instances.pdf

共3页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:3 页 大小:281.07KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 3
客服
关注