
Spot-on: A Checkpointing Framework for
Fault-Tolerant Long-running Workloads on Cloud
Spot Instances
Ashley Tung†, Haiyan Wang†, Yue Li†, Zhong Wang∗, and Jingchao Sun†
†MemVerge Inc., Milpitas, CA
∗Department of Energy Joint Genome Institute, Berkeley, CA
yue.li@memverge.com
Abstract—Spot instances offer a cost-effective solution for ap-
plications running in the cloud computing environment. However,
it is challenging to run long-running jobs on spot instances
because they are subject to unpredictable evictions. Here, we
present Spot-on, a generic software framework that supports
fault-tolerant long-running workloads on spot instances through
checkpoint and restart. Spot-on leverages existing checkpointing
packages and is compatible with the major cloud vendors. Using a
genomics application as a test case, we demonstrated that Spot-on
supports both application-specific and transparent checkpointing
methods. Compared to running applications using on-demand
instances, it allows the completion of these workloads for a
significant reduction in computing costs. Compared to running
applications using application-specific checkpoint mechanisms,
transparent checkpoint-protected applications reduce runtime by
up to 40%, leading to further cost savings of up to 86%.
I. INTRODUCTION
Major cloud vendors offer “spot virtual machine (VM)
instances” that utilize spare computing resources at steep dis-
counts [1] [2] [3]. However, a spot instance can be reclaimed
during a resource shortage with a short notice seconds or
minutes before a reclamation. Upon reclamation, all workloads
running on the instances are terminated, and the instance
is destroyed. This unpredictable nature makes it challenging
to run long-running workloads on spot instances without
checking points. This is not unlike Amazon EC2’s spot market
used in Proteus( [14]) and Tributary ( [15]). What sets Azure
spot instances apart is that there is no need to bid for any new
resources. Rather, the user is able to choose a VM size and
simply have the option to turn it into a Spot instance.
Checkpoint solutions developed in high-performance com-
puting systems can be adapted for the cloud environment [5]
[8] [6]. Both application-specific and transparent checkpoint-
ing technologies may be leveraged so that checkpoints can be
made on one spot instance and moved to restart on another
when the previous instance is reclaimed. However, to imple-
ment a practical solution that is user-friendly requires careful
integration with all the cloud platforms and schedulers to prop-
erly schedule, store, transfer, and restart checkpoints. In this
work, we implemented a practical framework called “Spot-
on” by integrating with the major cloud vendor’s spot instance
scheduler to evaluate the impact of checkpointing mechanisms
on running time and cost of long-running workloads. We used
a case study, a long-running metagenome assembly workload
(metaSPAdes, [9]), to compare the checkpointing methods on
Azure on-demand and spot instances. We found that both
checkpointing methods enable fault-tolerance metaSPAdes
workloads on spot instances to reduce cost. Compared to
using application-specific checkpointing mechanisms on spot
instances, metaSPAdes protected by transparent checkpointing
takes less time to finish, which leads to further cost reductions.
II. ARCHITECTURE AND DESIGN
The Spot-on checkpoint and restart workflow framework is
illustrated in Fig. 1. When a workload is launched on the
Fig. 1. The Spot-on Checkpoint and Restart Workflow across spot instances.
spot instance, a checkpoint coordinator, Spot-On, is launched
simultaneously. Running the coordinator does not provide
additional monetary cost to the user,as it is essentially a script
running in parallel to metaSPAdes. The coordinator has the
responsibility for checkpointing and restoration: it schedules
periodic checkpointing and monitors VM eviction events using
APIs provided by the cloud. Upon detecting an eviction
event, the coordinator creates a “termination checkpoint” in
addition to periodic checkpoints. Unlike the periodic check-
points, termination checkpoints are opportunistic due to their
possible failures caused by the short eviction notification (e.g.
arXiv:2210.02589v1 [cs.DC] 5 Oct 2022