国外技术干货：amazon_dynamo_sosp2007

VIP免费

2024-12-10 1 0 879.36KB 16 页 5.9玖币

侵权投诉

Dynamo: Amazon’s Highly Available Key-value Store

Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,

Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall

and Werner Vogels

Amazon.com

ABSTRACT

Reliability at massive scale is one of the biggest challenges we

face at Amazon.com, one of the largest e-commerce operations in

the world; even the slightest outage has significant financial

consequences and impacts customer trust. The Amazon.com

platform, which provides services for many web sites worldwide,

is implemented on top of an infrastructure of tens of thousands of

servers and network components located in many datacenters

around the world. At this scale, small and large components fail

continuously and the way persistent state is managed in the face

of these failures drives the reliability and scalability of the

software systems.

This paper presents the design and implementation of Dynamo, a

highly available key-value storage system that some of Amazon’s

core services use to provide an “always-on” experience. To

achieve this level of availability, Dynamo sacrifices consistency

under certain failure scenarios. It makes extensive use of object

versioning and application-assisted conflict resolution in a manner

that provides a novel interface for developers to use.

Categories and Subject Descriptors

D.4.2 [Operating Systems]: Storage Management; D.4.5

[Operating Systems]: Reliability; D.4.2 [Operating Systems]:

Performance;

General Terms

Algorithms, Management, Measurement, Performance, Design,

Reliability.

1. INTRODUCTION

Amazon runs a world-wide e-commerce platform that serves tens

of millions customers at peak times using tens of thousands of

servers located in many data centers around the world. There are

strict operational requirements on Amazon’s platform in terms of

performance, reliability and efficiency, and to support continuous

growth the platform needs to be highly scalable. Reliability is one

of the most important requirements because even the slightest

outage has significant financial consequences and impacts

customer trust. In addition, to support continuous growth, the

platform needs to be highly scalable.

One of the lessons our organization has learned from operating

Amazon’s platform is that the reliability and scalability of a

system is dependent on how its application state is managed.

Amazon uses a highly decentralized, loosely coupled, service

oriented architecture consisting of hundreds of services. In this

environment there is a particular need for storage technologies

that are always available. For example, customers should be able

to view and add items to their shopping cart even if disks are

failing, network routes are flapping, or data centers are being

destroyed by tornados. Therefore, the service responsible for

managing shopping carts requires that it can always write to and

read from its data store, and that its data needs to be available

across multiple data centers.

Dealing with failures in an infrastructure comprised of millions of

components is our standard mode of operation; there are always a

small but significant number of server and network components

that are failing at any given time. As such Amazon’s software

systems need to be constructed in a manner that treats failure

handling as the normal case without impacting availability or

performance.

To meet the reliability and scaling needs, Amazon has developed

a number of storage technologies, of which the Amazon Simple

Storage Service (also available outside of Amazon and known as

Amazon S3), is probably the best known. This paper presents the

design and implementation of Dynamo, another highly available

and scalable distributed data store built for Amazon’s platform.

Dynamo is used to manage the state of services that have very

high reliability requirements and need tight control over the

tradeoffs between availability, consistency, cost-effectiveness and

performance. Amazon’s platform has a very diverse set of

applications with different storage requirements. A select set of

applications requires a storage technology that is flexible enough

to let application designers configure their data store appropriately

based on these tradeoffs to achieve high availability and

guaranteed performance in the most cost effective manner.

There are many services on Amazon’s platform that only need

primary-key access to a data store. For many services, such as

those that provide best seller lists, shopping carts, customer

preferences, session management, sales rank, and product catalog,

the common pattern of using a relational database would lead to

inefficiencies and limit scale and availability. Dynamo provides a

simple primary-key only interface to meet the requirements of

these applications.

Dynamo uses a synthesis of well known techniques to achieve

scalability and availability: Data is partitioned and replicated

using consistent hashing [10], and consistency is facilitated by

object versioning [12]. The consistency among replicas during

updates is maintained by a quorum-like technique and a

decentralized replica synchronization protocol. Dynamo employs

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that

copies bear this notice and the full citation on the first page. To copy

otherwise, or republish, to post on servers or to redistribute to lists,

requires prior specific permission and/or a fee.

SOSP’07, October 14–17, 2007, Stevenson, Washington, USA.

195205

a gossip based distributed failure detection and membership

protocol. Dynamo is a completely decentralized system with

minimal need for manual administration. Storage nodes can be

added and removed from Dynamo without requiring any manual

partitioning or redistribution.

In the past year, Dynamo has been the underlying storage

technology for a number of the core services in Amazon’s e-

commerce platform. It was able to scale to extreme peak loads

efficiently without any downtime during the busy holiday

shopping season. For example, the service that maintains

shopping cart (Shopping Cart Service) served tens of millions

requests that resulted in well over 3 million checkouts in a single

day and the service that manages session state handled hundreds

of thousands of concurrently active sessions.

The main contribution of this work for the research community is

the evaluation of how different techniques can be combined to

provide a single highly-available system. It demonstrates that an

eventually-consistent storage system can be used in production

with demanding applications. It also provides insight into the

tuning of these techniques to meet the requirements of production

systems with very strict performance demands.

The paper is structured as follows. Section 2 presents the

background and Section 3 presents the related work. Section 4

presents the system design and Section 5 describes the

implementation. Section 6 details the experiences and insights

gained by running Dynamo in production and Section 7 concludes

the paper. There are a number of places in this paper where

additional information may have been appropriate but where

protecting Amazon’s business interests require us to reduce some

level of detail. For this reason, the intra- and inter-datacenter

latencies in section 6, the absolute request rates in section 6.2 and

outage lengths and workloads in section 6.3 are provided through

aggregate measures instead of absolute details.

2. BACKGROUND

Amazon’s e-commerce platform is composed of hundreds of

services that work in concert to deliver functionality ranging from

recommendations to order fulfillment to fraud detection. Each

service is exposed through a well defined interface and is

accessible over the network. These services are hosted in an

infrastructure that consists of tens of thousands of servers located

across many data centers world-wide. Some of these services are

stateless (i.e., services which aggregate responses from other

services) and some are stateful (i.e., a service that generates its

response by executing business logic on its state stored in

persistent store).

Traditionally production systems store their state in relational

databases. For many of the more common usage patterns of state

persistence, however, a relational database is a solution that is far

from ideal. Most of these services only store and retrieve data by

primary key and do not require the complex querying and

management functionality offered by an RDBMS. This excess

functionality requires expensive hardware and highly skilled

personnel for its operation, making it a very inefficient solution.

In addition, the available replication technologies are limited and

typically choose consistency over availability. Although many

advances have been made in the recent years, it is still not easy to

scale-out databases or use smart partitioning schemes for load

balancing.

This paper describes Dynamo, a highly available data storage

technology that addresses the needs of these important classes of

services. Dynamo has a simple key/value interface, is highly

available with a clearly defined consistency window, is efficient

in its resource usage, and has a simple scale out scheme to address

growth in data set size or request rates. Each service that uses

Dynamo runs its own Dynamo instances.

2.1 System Assumptions and Requirements

The storage system for this class of services has the following

requirements:

Query Model: simple read and write operations to a data item that

is uniquely identified by a key. State is stored as binary objects

(i.e., blobs) identified by unique keys. No operations span

multiple data items and there is no need for relational schema.

This requirement is based on the observation that a significant

portion of Amazon’s services can work with this simple query

model and do not need any relational schema. Dynamo targets

applications that need to store objects that are relatively small

(usually less than 1 MB).

ACID Properties: ACID (Atomicity, Consistency, Isolation,

Durability) is a set of properties that guarantee that database

transactions are processed reliably. In the context of databases, a

single logical operation on the data is called a transaction.

Experience at Amazon has shown that data stores that provide

ACID guarantees tend to have poor availability. This has been

widely acknowledged by both the industry and academia [5].

Dynamo targets applications that operate with weaker consistency

(the “C” in ACID) if this results in high availability. Dynamo

does not provide any isolation guarantees and permits only single

key updates.

Efficiency: The system needs to function on a commodity

hardware infrastructure. In Amazon’s platform, services have

stringent latency requirements which are in general measured at

the 99.9th percentile of the distribution. Given that state access

plays a crucial role in service operation the storage system must

be capable of meeting such stringent SLAs (see Section 2.2

below). Services must be able to configure Dynamo such that they

consistently achieve their latency and throughput requirements.

The tradeoffs are in performance, cost efficiency, availability, and

durability guarantees.

Other Assumptions: Dynamo is used only by Amazon’s internal

services. Its operation environment is assumed to be non-hostile

and there are no security related requirements such as

authentication and authorization. Moreover, since each service

uses its distinct instance of Dynamo, its initial design targets a

scale of up to hundreds of storage hosts. We will discuss the

scalability limitations of Dynamo and possible scalability related

extensions in later sections.

2.2 Service Level Agreements (SLA)

To guarantee that the application can deliver its functionality in a

bounded time, each and every dependency in the platform needs

to deliver its functionality with even tighter bounds. Clients and

services engage in a Service Level Agreement (SLA), a formally

negotiated contract where a client and a service agree on several

system-related characteristics, which most prominently include

the client’s expected request rate distribution for a particular API

and the expected service latency under those conditions. An

example of a simple SLA is a service guaranteeing that it will

196206

provide a response within 300ms for 99.9% of its requests for a

peak client load of 500 requests per second.

In Amazon’s decentralized service oriented infrastructure, SLAs

play an important role. For example a page request to one of the

e-commerce sites typically requires the rendering engine to

construct its response by sending requests to over 150 services.

These services often have multiple dependencies, which

frequently are other services, and as such it is not uncommon for

the call graph of an application to have more than one level. To

ensure that the page rendering engine can maintain a clear bound

on page delivery each service within the call chain must obey its

performance contract.

Figure 1 shows an abstract view of the architecture of Amazon’s

platform, where dynamic web content is generated by page

rendering components which in turn query many other services. A

service can use different data stores to manage its state and these

data stores are only accessible within its service boundaries. Some

services act as aggregators by using several other services to

produce a composite response. Typically, the aggregator services

are stateless, although they use extensive caching.

A common approach in the industry for forming a performance

oriented SLA is to describe it using average, median and expected

variance. At Amazon we have found that these metrics are not

good enough if the goal is to build a system where all customers

have a good experience, rather than just the majority. For

example if extensive personalization techniques are used then

customers with longer histories require more processing which

impacts performance at the high-end of the distribution. An SLA

stated in terms of mean or median response times will not address

the performance of this important customer segment. To address

this issue, at Amazon, SLAs are expressed and measured at the

99.9th percentile of the distribution. The choice for 99.9% over an

even higher percentile has been made based on a cost-benefit

analysis which demonstrated a significant increase in cost to

improve performance that much. Experiences with Amazon’s

production systems have shown that this approach provides a

better overall experience compared to those systems that meet

SLAs defined based on the mean or median.

In this paper there are many references to this 99.9th percentile of

distributions, which reflects Amazon engineers’ relentless focus

on performance from the perspective of the customers’

experience. Many papers report on averages, so these are included

where it makes sense for comparison purposes. Nevertheless,

Amazon’s engineering and optimization efforts are not focused on

averages. Several techniques, such as the load balanced selection

of write coordinators, are purely targeted at controlling

performance at the 99.9th percentile.

Storage systems often play an important role in establishing a

service’s SLA, especially if the business logic is relatively

lightweight, as is the case for many Amazon services. State

management then becomes the main component of a service’s

SLA. One of the main design considerations for Dynamo is to

give services control over their system properties, such as

durability and consistency, and to let services make their own

tradeoffs between functionality, performance and cost-

effectiveness.

2.3 Design Considerations

Data replication algorithms used in commercial systems

traditionally perform synchronous replica coordination in order to

provide a strongly consistent data access interface. To achieve this

level of consistency, these algorithms are forced to tradeoff the

availability of the data under certain failure scenarios. For

instance, rather than dealing with the uncertainty of the

correctness of an answer, the data is made unavailable until it is

absolutely certain that it is correct. From the very early replicated

database works, it is well known that when dealing with the

possibility of network failures, strong consistency and high data

availability cannot be achieved simultaneously [2, 11]. As such

systems and applications need to be aware which properties can

be achieved under which conditions.

For systems prone to server and network failures, availability can

be increased by using optimistic replication techniques, where

changes are allowed to propagate to replicas in the background,

and concurrent, disconnected work is tolerated. The challenge

with this approach is that it can lead to conflicting changes which

must be detected and resolved. This process of conflict resolution

introduces two problems: when to resolve them and who resolves

them. Dynamo is designed to be an eventually consistent data

store; that is all updates reach all replicas eventually.

An important design consideration is to decide when to perform

the process of resolving update conflicts, i.e., whether conflicts

should be resolved during reads or writes. Many traditional data

stores execute conflict resolution during writes and keep the read

complexity simple [7]. In such systems, writes may be rejected if

the data store cannot reach all (or a majority of) the replicas at a

given time. On the other hand, Dynamo targets the design space

of an “always writeable” data store (i.e., a data store that is highly

available for writes). For a number of Amazon services, rejecting

customer updates could result in a poor customer experience. For

instance, the shopping cart service must allow customers to add

and remove items from their shopping cart even amidst network

and server failures. This requirement forces us to push the

complexity of conflict resolution to the reads in order to ensure

that writes are never rejected.

Figure 1: Service-oriented architecture of Amazon’s

platform

197207

The next design choice is who performs the process of conflict

resolution. This can be done by the data store or the application. If

conflict resolution is done by the data store, its choices are rather

limited. In such cases, the data store can only use simple policies,

such as “last write wins” [22], to resolve conflicting updates. On

the other hand, since the application is aware of the data schema it

can decide on the conflict resolution method that is best suited for

its client’s experience. For instance, the application that maintains

customer shopping carts can choose to “merge” the conflicting

versions and return a single unified shopping cart. Despite this

flexibility, some application developers may not want to write

their own conflict resolution mechanisms and choose to push it

down to the data store, which in turn chooses a simple policy such

as “last write wins”.

Other key principles embraced in the design are:

Incremental scalability: Dynamo should be able to scale out one

storage host (henceforth, referred to as “node”) at a time, with

minimal impact on both operators of the system and the system

itself.

Symmetry: Every node in Dynamo should have the same set of

responsibilities as its peers; there should be no distinguished node

or nodes that take special roles or extra set of responsibilities. In

our experience, symmetry simplifies the process of system

provisioning and maintenance.

Decentralization: An extension of symmetry, the design should

favor decentralized peer-to-peer techniques over centralized

control. In the past, centralized control has resulted in outages and

the goal is to avoid it as much as possible. This leads to a simpler,

more scalable, and more available system.

Heterogeneity: The system needs to be able to exploit

heterogeneity in the infrastructure it runs on. e.g. the work

distribution must be proportional to the capabilities of the

individual servers. This is essential in adding new nodes with

higher capacity without having to upgrade all hosts at once.

3. RELATED WORK

3.1 Peer to Peer Systems

There are several peer-to-peer (P2P) systems that have looked at

the problem of data storage and distribution. The first generation

of P2P systems, such as Freenet and Gnutella1, were

predominantly used as file sharing systems. These were examples

of unstructured P2P networks where the overlay links between

peers were established arbitrarily. In these networks, a search

query is usually flooded through the network to find as many

peers as possible that share the data. P2P systems evolved to the

next generation into what is widely known as structured P2P

networks. These networks employ a globally consistent protocol

to ensure that any node can efficiently route a search query to

some peer that has the desired data. Systems like Pastry [16] and

Chord [20] use routing mechanisms to ensure that queries can be

answered within a bounded number of hops. To reduce the

additional latency introduced by multi-hop routing, some P2P

systems (e.g., [14]) employ O(1) routing where each peer

maintains enough routing information locally so that it can route

requests (to access a data item) to the appropriate peer within a

constant number of hops.

Various storage systems, such as Oceanstore [9] and PAST [17]

were built on top of these routing overlays. Oceanstore provides a

global, transactional, persistent storage service that supports

serialized updates on widely replicated data. To allow for

concurrent updates while avoiding many of the problems inherent

with wide-area locking, it uses an update model based on conflict

resolution. Conflict resolution was introduced in [21] to reduce

the number of transaction aborts. Oceanstore resolves conflicts by

processing a series of updates, choosing a total order among them,

and then applying them atomically in that order. It is built for an

environment where the data is replicated on an untrusted

infrastructure. By comparison, PAST provides a simple

abstraction layer on top of Pastry for persistent and immutable

objects. It assumes that the application can build the necessary

storage semantics (such as mutable files) on top of it.

3.2 Distributed File Systems and Databases

Distributing data for performance, availability and durability has

been widely studied in the file system and database systems

community. Compared to P2P storage systems that only support

flat namespaces, distributed file systems typically support

hierarchical namespaces. Systems like Ficus [15] and Coda [19]

replicate files for high availability at the expense of consistency.

Update conflicts are typically managed using specialized conflict

resolution procedures. The Farsite system [1] is a distributed file

system that does not use any centralized server like NFS. Farsite

achieves high availability and scalability using replication. The

Google File System [6] is another distributed file system built for

hosting the state of Google’s internal applications. GFS uses a

simple design with a single master server for hosting the entire

metadata and where the data is split into chunks and stored in

chunkservers. Bayou is a distributed relational database system

that allows disconnected operations and provides eventual data

consistency [21].

Among these systems, Bayou, Coda and Ficus allow disconnected

operations and are resilient to issues such as network partitions

and outages. These systems differ on their conflict resolution

procedures. For instance, Coda and Ficus perform system level

conflict resolution and Bayou allows application level resolution.

All of them, however, guarantee eventual consistency. Similar to

these systems, Dynamo allows read and write operations to

continue even during network partitions and resolves updated

conflicts using different conflict resolution mechanisms.

Distributed block storage systems like FAB [18] split large size

objects into smaller blocks and stores each block in a highly

available manner. In comparison to these systems, a key-value

store is more suitable in this case because: (a) it is intended to

store relatively small objects (size < 1M) and (b) key-value stores

are easier to configure on a per-application basis. Antiquity is a

wide-area distributed storage system designed to handle multiple

server failures [23]. It uses a secure log to preserve data integrity,

replicates each log on multiple servers for durability, and uses

Byzantine fault tolerance protocols to ensure data consistency. In

contrast to Antiquity, Dynamo does not focus on the problem of

data integrity and security and is built for a trusted environment.

Bigtable is a distributed storage system for managing structured

data. It maintains a sparse, multi-dimensional sorted map and

allows applications to access their data using multiple attributes

[2]. Compared to Bigtable, Dynamo targets applications that

require only key/value access with primary focus on high

availability where updates are not rejected even in the wake of

network partitions or server failures.

1 http://freenetproject.org/, http://www.gnutella.org

198208

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

5.9 玖币 0人已下载

立即下载 VIP免费下载

摘要：

Dynamo:Amazon’sHighlyAvailableKey-valueStoreGiuseppeDeCandia,DenizHastorun,MadanJampani,GunavardhanKakulapati,AvinashLakshman,AlexPilchin,SwaminathanSivasubramanian,PeterVosshallandWernerVogelsAmazon.comABSTRACTReliabilityatmassivescaleisoneofthebiggestchallengeswefaceatAmazon.com,oneofthelargeste-c...

展开>> 收起<<

国外技术干货：amazon_dynamo_sosp2007.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

国外技术干货：amazon_dynamo_sosp2007

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: