国外技术干货:amazon_dynamo_sosp2007

VIP免费
2024-12-10 1 0 879.36KB 16 页 5.9玖币
侵权投诉
Dynamo: Amazon’s Highly Available Key-value Store
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,
Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall
and Werner Vogels
Amazon.com
ABSTRACT
Reliability at massive scale is one of the biggest challenges we
face at Amazon.com, one of the largest e-commerce operations in
the world; even the slightest outage has significant financial
consequences and impacts customer trust. The Amazon.com
platform, which provides services for many web sites worldwide,
is implemented on top of an infrastructure of tens of thousands of
servers and network components located in many datacenters
around the world. At this scale, small and large components fail
continuously and the way persistent state is managed in the face
of these failures drives the reliability and scalability of the
software systems.
This paper presents the design and implementation of Dynamo, a
highly available key-value storage system that some of Amazon’s
core services use to provide an “always-on” experience. To
achieve this level of availability, Dynamo sacrifices consistency
under certain failure scenarios. It makes extensive use of object
versioning and application-assisted conflict resolution in a manner
that provides a novel interface for developers to use.
Categories and Subject Descriptors
D.4.2 [Operating Systems]: Storage Management; D.4.5
[Operating Systems]: Reliability; D.4.2 [Operating Systems]:
Performance;
General Terms
Algorithms, Management, Measurement, Performance, Design,
Reliability.
1. INTRODUCTION
Amazon runs a world-wide e-commerce platform that serves tens
of millions customers at peak times using tens of thousands of
servers located in many data centers around the world. There are
strict operational requirements on Amazon’s platform in terms of
performance, reliability and efficiency, and to support continuous
growth the platform needs to be highly scalable. Reliability is one
of the most important requirements because even the slightest
outage has significant financial consequences and impacts
customer trust. In addition, to support continuous growth, the
platform needs to be highly scalable.
One of the lessons our organization has learned from operating
Amazon’s platform is that the reliability and scalability of a
system is dependent on how its application state is managed.
Amazon uses a highly decentralized, loosely coupled, service
oriented architecture consisting of hundreds of services. In this
environment there is a particular need for storage technologies
that are always available. For example, customers should be able
to view and add items to their shopping cart even if disks are
failing, network routes are flapping, or data centers are being
destroyed by tornados. Therefore, the service responsible for
managing shopping carts requires that it can always write to and
read from its data store, and that its data needs to be available
across multiple data centers.
Dealing with failures in an infrastructure comprised of millions of
components is our standard mode of operation; there are always a
small but significant number of server and network components
that are failing at any given time. As such Amazon’s software
systems need to be constructed in a manner that treats failure
handling as the normal case without impacting availability or
performance.
To meet the reliability and scaling needs, Amazon has developed
a number of storage technologies, of which the Amazon Simple
Storage Service (also available outside of Amazon and known as
Amazon S3), is probably the best known. This paper presents the
design and implementation of Dynamo, another highly available
and scalable distributed data store built for Amazon’s platform.
Dynamo is used to manage the state of services that have very
high reliability requirements and need tight control over the
tradeoffs between availability, consistency, cost-effectiveness and
performance. Amazon’s platform has a very diverse set of
applications with different storage requirements. A select set of
applications requires a storage technology that is flexible enough
to let application designers configure their data store appropriately
based on these tradeoffs to achieve high availability and
guaranteed performance in the most cost effective manner.
There are many services on Amazon’s platform that only need
primary-key access to a data store. For many services, such as
those that provide best seller lists, shopping carts, customer
preferences, session management, sales rank, and product catalog,
the common pattern of using a relational database would lead to
inefficiencies and limit scale and availability. Dynamo provides a
simple primary-key only interface to meet the requirements of
these applications.
Dynamo uses a synthesis of well known techniques to achieve
scalability and availability: Data is partitioned and replicated
using consistent hashing [10], and consistency is facilitated by
object versioning [12]. The consistency among replicas during
updates is maintained by a quorum-like technique and a
decentralized replica synchronization protocol. Dynamo employs
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SOSP’07, October 14–17, 2007, Stevenson, Washington, USA.
Copyright 2007 ACM 978-1-59593-591-5/07/0010...$5.00.
195205
a gossip based distributed failure detection and membership
protocol. Dynamo is a completely decentralized system with
minimal need for manual administration. Storage nodes can be
added and removed from Dynamo without requiring any manual
partitioning or redistribution.
In the past year, Dynamo has been the underlying storage
technology for a number of the core services in Amazon’s e-
commerce platform. It was able to scale to extreme peak loads
efficiently without any downtime during the busy holiday
shopping season. For example, the service that maintains
shopping cart (Shopping Cart Service) served tens of millions
requests that resulted in well over 3 million checkouts in a single
day and the service that manages session state handled hundreds
of thousands of concurrently active sessions.
The main contribution of this work for the research community is
the evaluation of how different techniques can be combined to
provide a single highly-available system. It demonstrates that an
eventually-consistent storage system can be used in production
with demanding applications. It also provides insight into the
tuning of these techniques to meet the requirements of production
systems with very strict performance demands.
The paper is structured as follows. Section 2 presents the
background and Section 3 presents the related work. Section 4
presents the system design and Section 5 describes the
implementation. Section 6 details the experiences and insights
gained by running Dynamo in production and Section 7 concludes
the paper. There are a number of places in this paper where
additional information may have been appropriate but where
protecting Amazon’s business interests require us to reduce some
level of detail. For this reason, the intra- and inter-datacenter
latencies in section 6, the absolute request rates in section 6.2 and
outage lengths and workloads in section 6.3 are provided through
aggregate measures instead of absolute details.
2. BACKGROUND
Amazon’s e-commerce platform is composed of hundreds of
services that work in concert to deliver functionality ranging from
recommendations to order fulfillment to fraud detection. Each
service is exposed through a well defined interface and is
accessible over the network. These services are hosted in an
infrastructure that consists of tens of thousands of servers located
across many data centers world-wide. Some of these services are
stateless (i.e., services which aggregate responses from other
services) and some are stateful (i.e., a service that generates its
response by executing business logic on its state stored in
persistent store).
Traditionally production systems store their state in relational
databases. For many of the more common usage patterns of state
persistence, however, a relational database is a solution that is far
from ideal. Most of these services only store and retrieve data by
primary key and do not require the complex querying and
management functionality offered by an RDBMS. This excess
functionality requires expensive hardware and highly skilled
personnel for its operation, making it a very inefficient solution.
In addition, the available replication technologies are limited and
typically choose consistency over availability. Although many
advances have been made in the recent years, it is still not easy to
scale-out databases or use smart partitioning schemes for load
balancing.
This paper describes Dynamo, a highly available data storage
technology that addresses the needs of these important classes of
services. Dynamo has a simple key/value interface, is highly
available with a clearly defined consistency window, is efficient
in its resource usage, and has a simple scale out scheme to address
growth in data set size or request rates. Each service that uses
Dynamo runs its own Dynamo instances.
2.1 System Assumptions and Requirements
The storage system for this class of services has the following
requirements:
Query Model: simple read and write operations to a data item that
is uniquely identified by a key. State is stored as binary objects
(i.e., blobs) identified by unique keys. No operations span
multiple data items and there is no need for relational schema.
This requirement is based on the observation that a significant
portion of Amazon’s services can work with this simple query
model and do not need any relational schema. Dynamo targets
applications that need to store objects that are relatively small
(usually less than 1 MB).
ACID Properties: ACID (Atomicity, Consistency, Isolation,
Durability) is a set of properties that guarantee that database
transactions are processed reliably. In the context of databases, a
single logical operation on the data is called a transaction.
Experience at Amazon has shown that data stores that provide
ACID guarantees tend to have poor availability. This has been
widely acknowledged by both the industry and academia [5].
Dynamo targets applications that operate with weaker consistency
(the “C” in ACID) if this results in high availability. Dynamo
does not provide any isolation guarantees and permits only single
key updates.
Efficiency: The system needs to function on a commodity
hardware infrastructure. In Amazon’s platform, services have
stringent latency requirements which are in general measured at
the 99.9th percentile of the distribution. Given that state access
plays a crucial role in service operation the storage system must
be capable of meeting such stringent SLAs (see Section 2.2
below). Services must be able to configure Dynamo such that they
consistently achieve their latency and throughput requirements.
The tradeoffs are in performance, cost efficiency, availability, and
durability guarantees.
Other Assumptions: Dynamo is used only by Amazon’s internal
services. Its operation environment is assumed to be non-hostile
and there are no security related requirements such as
authentication and authorization. Moreover, since each service
uses its distinct instance of Dynamo, its initial design targets a
scale of up to hundreds of storage hosts. We will discuss the
scalability limitations of Dynamo and possible scalability related
extensions in later sections.
2.2 Service Level Agreements (SLA)
To guarantee that the application can deliver its functionality in a
bounded time, each and every dependency in the platform needs
to deliver its functionality with even tighter bounds. Clients and
services engage in a Service Level Agreement (SLA), a formally
negotiated contract where a client and a service agree on several
system-related characteristics, which most prominently include
the client’s expected request rate distribution for a particular API
and the expected service latency under those conditions. An
example of a simple SLA is a service guaranteeing that it will
196206
provide a response within 300ms for 99.9% of its requests for a
peak client load of 500 requests per second.
In Amazon’s decentralized service oriented infrastructure, SLAs
play an important role. For example a page request to one of the
e-commerce sites typically requires the rendering engine to
construct its response by sending requests to over 150 services.
These services often have multiple dependencies, which
frequently are other services, and as such it is not uncommon for
the call graph of an application to have more than one level. To
ensure that the page rendering engine can maintain a clear bound
on page delivery each service within the call chain must obey its
performance contract.
Figure 1 shows an abstract view of the architecture of Amazon’s
platform, where dynamic web content is generated by page
rendering components which in turn query many other services. A
service can use different data stores to manage its state and these
data stores are only accessible within its service boundaries. Some
services act as aggregators by using several other services to
produce a composite response. Typically, the aggregator services
are stateless, although they use extensive caching.
A common approach in the industry for forming a performance
oriented SLA is to describe it using average, median and expected
variance. At Amazon we have found that these metrics are not
good enough if the goal is to build a system where all customers
have a good experience, rather than just the majority. For
example if extensive personalization techniques are used then
customers with longer histories require more processing which
impacts performance at the high-end of the distribution. An SLA
stated in terms of mean or median response times will not address
the performance of this important customer segment. To address
this issue, at Amazon, SLAs are expressed and measured at the
99.9th percentile of the distribution. The choice for 99.9% over an
even higher percentile has been made based on a cost-benefit
analysis which demonstrated a significant increase in cost to
improve performance that much. Experiences with Amazon’s
production systems have shown that this approach provides a
better overall experience compared to those systems that meet
SLAs defined based on the mean or median.
In this paper there are many references to this 99.9th percentile of
distributions, which reflects Amazon engineers’ relentless focus
on performance from the perspective of the customers’
experience. Many papers report on averages, so these are included
where it makes sense for comparison purposes. Nevertheless,
Amazon’s engineering and optimization efforts are not focused on
averages. Several techniques, such as the load balanced selection
of write coordinators, are purely targeted at controlling
performance at the 99.9th percentile.
Storage systems often play an important role in establishing a
service’s SLA, especially if the business logic is relatively
lightweight, as is the case for many Amazon services. State
management then becomes the main component of a service’s
SLA. One of the main design considerations for Dynamo is to
give services control over their system properties, such as
durability and consistency, and to let services make their own
tradeoffs between functionality, performance and cost-
effectiveness.
2.3 Design Considerations
Data replication algorithms used in commercial systems
traditionally perform synchronous replica coordination in order to
provide a strongly consistent data access interface. To achieve this
level of consistency, these algorithms are forced to tradeoff the
availability of the data under certain failure scenarios. For
instance, rather than dealing with the uncertainty of the
correctness of an answer, the data is made unavailable until it is
absolutely certain that it is correct. From the very early replicated
database works, it is well known that when dealing with the
possibility of network failures, strong consistency and high data
availability cannot be achieved simultaneously [2, 11]. As such
systems and applications need to be aware which properties can
be achieved under which conditions.
For systems prone to server and network failures, availability can
be increased by using optimistic replication techniques, where
changes are allowed to propagate to replicas in the background,
and concurrent, disconnected work is tolerated. The challenge
with this approach is that it can lead to conflicting changes which
must be detected and resolved. This process of conflict resolution
introduces two problems: when to resolve them and who resolves
them. Dynamo is designed to be an eventually consistent data
store; that is all updates reach all replicas eventually.
An important design consideration is to decide when to perform
the process of resolving update conflicts, i.e., whether conflicts
should be resolved during reads or writes. Many traditional data
stores execute conflict resolution during writes and keep the read
complexity simple [7]. In such systems, writes may be rejected if
the data store cannot reach all (or a majority of) the replicas at a
given time. On the other hand, Dynamo targets the design space
of an “always writeable” data store (i.e., a data store that is highly
available for writes). For a number of Amazon services, rejecting
customer updates could result in a poor customer experience. For
instance, the shopping cart service must allow customers to add
and remove items from their shopping cart even amidst network
and server failures. This requirement forces us to push the
complexity of conflict resolution to the reads in order to ensure
that writes are never rejected.
Figure 1: Service-oriented architecture of Amazon’s
platform
197207
The next design choice is who performs the process of conflict
resolution. This can be done by the data store or the application. If
conflict resolution is done by the data store, its choices are rather
limited. In such cases, the data store can only use simple policies,
such as “last write wins” [22], to resolve conflicting updates. On
the other hand, since the application is aware of the data schema it
can decide on the conflict resolution method that is best suited for
its client’s experience. For instance, the application that maintains
customer shopping carts can choose to “merge” the conflicting
versions and return a single unified shopping cart. Despite this
flexibility, some application developers may not want to write
their own conflict resolution mechanisms and choose to push it
down to the data store, which in turn chooses a simple policy such
as “last write wins”.
Other key principles embraced in the design are:
Incremental scalability: Dynamo should be able to scale out one
storage host (henceforth, referred to as “node”) at a time, with
minimal impact on both operators of the system and the system
itself.
Symmetry: Every node in Dynamo should have the same set of
responsibilities as its peers; there should be no distinguished node
or nodes that take special roles or extra set of responsibilities. In
our experience, symmetry simplifies the process of system
provisioning and maintenance.
Decentralization: An extension of symmetry, the design should
favor decentralized peer-to-peer techniques over centralized
control. In the past, centralized control has resulted in outages and
the goal is to avoid it as much as possible. This leads to a simpler,
more scalable, and more available system.
Heterogeneity: The system needs to be able to exploit
heterogeneity in the infrastructure it runs on. e.g. the work
distribution must be proportional to the capabilities of the
individual servers. This is essential in adding new nodes with
higher capacity without having to upgrade all hosts at once.
3. RELATED WORK
3.1 Peer to Peer Systems
There are several peer-to-peer (P2P) systems that have looked at
the problem of data storage and distribution. The first generation
of P2P systems, such as Freenet and Gnutella1, were
predominantly used as file sharing systems. These were examples
of unstructured P2P networks where the overlay links between
peers were established arbitrarily. In these networks, a search
query is usually flooded through the network to find as many
peers as possible that share the data. P2P systems evolved to the
next generation into what is widely known as structured P2P
networks. These networks employ a globally consistent protocol
to ensure that any node can efficiently route a search query to
some peer that has the desired data. Systems like Pastry [16] and
Chord [20] use routing mechanisms to ensure that queries can be
answered within a bounded number of hops. To reduce the
additional latency introduced by multi-hop routing, some P2P
systems (e.g., [14]) employ O(1) routing where each peer
maintains enough routing information locally so that it can route
requests (to access a data item) to the appropriate peer within a
constant number of hops.
Various storage systems, such as Oceanstore [9] and PAST [17]
were built on top of these routing overlays. Oceanstore provides a
global, transactional, persistent storage service that supports
serialized updates on widely replicated data. To allow for
concurrent updates while avoiding many of the problems inherent
with wide-area locking, it uses an update model based on conflict
resolution. Conflict resolution was introduced in [21] to reduce
the number of transaction aborts. Oceanstore resolves conflicts by
processing a series of updates, choosing a total order among them,
and then applying them atomically in that order. It is built for an
environment where the data is replicated on an untrusted
infrastructure. By comparison, PAST provides a simple
abstraction layer on top of Pastry for persistent and immutable
objects. It assumes that the application can build the necessary
storage semantics (such as mutable files) on top of it.
3.2 Distributed File Systems and Databases
Distributing data for performance, availability and durability has
been widely studied in the file system and database systems
community. Compared to P2P storage systems that only support
flat namespaces, distributed file systems typically support
hierarchical namespaces. Systems like Ficus [15] and Coda [19]
replicate files for high availability at the expense of consistency.
Update conflicts are typically managed using specialized conflict
resolution procedures. The Farsite system [1] is a distributed file
system that does not use any centralized server like NFS. Farsite
achieves high availability and scalability using replication. The
Google File System [6] is another distributed file system built for
hosting the state of Google’s internal applications. GFS uses a
simple design with a single master server for hosting the entire
metadata and where the data is split into chunks and stored in
chunkservers. Bayou is a distributed relational database system
that allows disconnected operations and provides eventual data
consistency [21].
Among these systems, Bayou, Coda and Ficus allow disconnected
operations and are resilient to issues such as network partitions
and outages. These systems differ on their conflict resolution
procedures. For instance, Coda and Ficus perform system level
conflict resolution and Bayou allows application level resolution.
All of them, however, guarantee eventual consistency. Similar to
these systems, Dynamo allows read and write operations to
continue even during network partitions and resolves updated
conflicts using different conflict resolution mechanisms.
Distributed block storage systems like FAB [18] split large size
objects into smaller blocks and stores each block in a highly
available manner. In comparison to these systems, a key-value
store is more suitable in this case because: (a) it is intended to
store relatively small objects (size < 1M) and (b) key-value stores
are easier to configure on a per-application basis. Antiquity is a
wide-area distributed storage system designed to handle multiple
server failures [23]. It uses a secure log to preserve data integrity,
replicates each log on multiple servers for durability, and uses
Byzantine fault tolerance protocols to ensure data consistency. In
contrast to Antiquity, Dynamo does not focus on the problem of
data integrity and security and is built for a trusted environment.
Bigtable is a distributed storage system for managing structured
data. It maintains a sparse, multi-dimensional sorted map and
allows applications to access their data using multiple attributes
[2]. Compared to Bigtable, Dynamo targets applications that
require only key/value access with primary focus on high
availability where updates are not rejected even in the wake of
network partitions or server failures.
1 http://freenetproject.org/, http://www.gnutella.org
198208
摘要:

Dynamo:Amazon’sHighlyAvailableKey-valueStoreGiuseppeDeCandia,DenizHastorun,MadanJampani,GunavardhanKakulapati,AvinashLakshman,AlexPilchin,SwaminathanSivasubramanian,PeterVosshallandWernerVogelsAmazon.comABSTRACTReliabilityatmassivescaleisoneofthebiggestchallengeswefaceatAmazon.com,oneofthelargeste-c...

展开>> 收起<<
国外技术干货:amazon_dynamo_sosp2007.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:计算机 价格:5.9玖币 属性:16 页 大小:879.36KB 格式:PDF 时间:2024-12-10

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注