Understanding Inconsistency in Azure Cosmos DB with TLA A. Finn Hackett

2025-05-06 1 0 402.77KB 11 页 10玖币

侵权投诉

Understanding Inconsistency in Azure Cosmos DB

with TLA+

A. Finn Hackett

University of British Columbia

Vancouver, Canada

fhackett@cs.ubc.ca

Joshua Rowe

Microsoft

Redmond, USA

joshua.rowe@microsoft.com

Markus Alexander Kuppe

Microsoft Research

Redmond, USA

makuppe@microsoft.com

Abstract—Beyond implementation correctness of a distributed

system, it is equally important to understand exactly what users

should expect to see from that system. Even if the system itself

works as designed, insufﬁcient understanding of its user-visible

semantics can cause bugs in its dependencies. By focusing a

formal speciﬁcation effort on precisely deﬁning the expected user-

facing behaviors of the Azure Cosmos DB service at Microsoft,

we were able to write a formal speciﬁcation of the database that

was signiﬁcantly smaller and conceptually simpler than any other

speciﬁcation of Cosmos DB, while representing a wider range

of valid user-observable behaviors than existing more detailed

speciﬁcations. Many of the additional behaviors we documented

were previously poorly understood outside of the Cosmos DB

development team, even informally, leading to data consistency

errors in Microsoft products that depend on it. Using this model,

we were able to raise two key issues in Cosmos DB’s public-facing

documentation, which have since been addressed. We were also

able to offer a fundamental solution to a previous high-impact

outage within another Azure service that depends on Cosmos

DB.

Index Terms—cloud computing, formal methods, model check-

ing, documentation

I. INTRODUCTION

Consistency guarantees for distributed databases are noto-

riously hard to understand. Not only can distributed systems

inherently behave in unexpected and counter-intuitive ways

due to internal concurrency and failures, but they can also

lull their users into a false sense of functional correctness:

most of the time, users of a distributed database will witness

a much simpler and more consistent set of behaviors than

what is actually possible. Only timeouts, fail-overs, or other

rare events will expose the true set of behaviors a user

might witness [1]. Testing for these scenarios is difﬁcult at

best: reproducing them reliably requires controlling complex

concurrency factors, latency variations, and network behaviors.

Even just producing usable documentation for developers

is fundamentally challenging [2], [3], [4], and explaining

these subtle consistency issues via documentation comes as

an additional burden to distributed system developers and

technical writers alike. Formal methods have long been applied

to the design of distributed systems, including in industry [5],

[6], [7], [8], but these are years-long high-effort projects

that focus on implementation correctness, not explaining the

system to users. Rather than focus on this difﬁcult task, we

address a simpler and more fundamental question: ignoring

the implementation, what kind of behavior should a client be

able to witness while interacting with a service?

We use TLA+to answer this simpler question for Cosmos

DB, a planet-scale key-value store. In practice, Cosmos DB

offers a rich interface featuring multiple query APIs, and

has complex operational behaviors involving georeplication

and partitioning of data. As our focus is on data consistency

from a client perspective, we model only the core read and

write operations underlying the system’s semantics relating

to their 5 conﬁgurable consistency levels. We show that this

minimal client-focused speciﬁcation of a large-scale service

offers important design- and documentation-level insights,

while keeping buy-in cost low.

We document the 2 person-month development process

of our speciﬁcation, which consists of iterative prototyping

using the public documentation [9], feedback from author 2,

a Cosmos DB developer, and the speciﬁcation and model

checking of a collection of formal properties based on our

understanding. Aside from the speciﬁcation itself, we discuss

a pair of key issues it helped us discover within Cosmos DB’s

documentation, and how both have since been addressed. We

also use our speciﬁcation to explain the previously-unclear

root cause of a 28-day high-priority outage within Microsoft

Azure.

We describe the following results: (1) a concise (390

LOC) client-focused speciﬁcation of Cosmos DB, a large-

scale distributed system; (2) a pair of key documentation bugs

we found by developing our speciﬁcation — statements in

Cosmos DB’s public documentation [9] that have now been

corrected; and (3) using our speciﬁcation, a novel and concise

mechanized explanation of a high-severity Cosmos DB-related

outage within Azure that took 28 days to identify and mitigate.

Beyond our work so far, we expect our model to be

useful in future design work as Cosmos DB’s implementation

evolves, aided by its ability to precisely and abstractly state a

client’s expectations of system behavior. Services depending

on Cosmos DB may also beneﬁt from incorporating our work

into TLA+speciﬁcations of their own processes, in which case

our work may be used to prevent future outages similar to the

one we describe in this paper.

II. BACKGROUND

Our work uses the TLA+speciﬁcation language [10], which

can be used to describe state machines using set-theoretic

arXiv:2210.13661v1 [cs.SE] 24 Oct 2022

constructs and temporal logic. Models written in TLA+have

no direct correspondence to implementations, with users fo-

cusing instead on analyzing design decisions and verifying

model-level correctness properties. This philosophy allows

speciﬁcation writers to leave out irrelevant details and focus

on expressing a speciﬁcation’s core semantics as simply as

possible.

In addition to plain TLA+, model developers can also write

models in PlusCal [11], a high-level imperative language that

is more like contemporary programming languages.

It is possible to check model properties using the explicit-

state model checker TLC [12], [13], the symbolic model

checker Apalache [14], and the manual proof assistant

TLAPS [15]. In this work, we relied on the TLC model

checker to analyze our speciﬁcation.

As well as model checking temporal properties, it is also

possible to express and check reﬁnements [16] in TLA+. A

reﬁnement proves that one speciﬁcation implements another

– meaning that one speciﬁcation exhibits every behavior that

another speciﬁcation exhibits, given an appropriate translation

between the two speciﬁcations’ state spaces. We use this tech-

nique to show that our new speciﬁcation produces a superset

of the behaviors produced by existing TLA+speciﬁcations of

Cosmos DB.

III. A SIMPLE MODEL OF COSMOS DB

To fully illustrate our claim to simplicity, this section

describes our full formalization of Cosmos DB’s semantics

in a few pages, including most of the core TLA+deﬁni-

tions in-text. Our modeling process was based on iterative

discussion with author 2, a principal engineer working on the

Cosmos DB implementation. We followed existing user-facing

documentation, asked for feedback, learned more about the

realities of Cosmos DB’s design, and incorporated that new

knowledge into our speciﬁcation. We repeated this feedback

loop until we found no more corrections, when our model

began to predict counter-intuitive but possible behaviors of

the real system. See Subsection IV-D for in-depth analysis of

such behaviors. See https://github.com/tlaplus/azure-cosmos-

tla/tree/master/simple-model for the full TLA+deﬁnitions.

A. Consistency Levels

Cosmos DB offers 5 consistency levels that affect read

and write behavior. A system administrator must conﬁgure

all writes to follow a single consistency level per Cosmos DB

deployment. Read operations may either match the conﬁgured

write consistency level or weaken it according to the hierarchy

deﬁned in Figure 1. We discuss high-level prose descriptions

of these consistency levels, which we complement with precise

TLA+descriptions later on.

Strong consistency. Reads and writes are linearizable [17],

as long as the operation does not fail. See Section IV-D for

possible behaviors given failures.

Bounded staleness. Writes older than a given time bound

are durable and consistently readable, whereas writes younger

than the given bound are not. The time bound is deﬁned two

Bounded Staleness

Session

Consistent Prefix

Eventual

Strong

Fig. 1. Hierarchy of consistency levels in Cosmos DB, with strongest at the

top and weakest guarantees at the bottom.

ways: a bound in wall-clock time, and a maximum bound on

the number of eventually-consistent writes allowed at once.

If the bounds are in danger of being exceeded, additional

writes will be refused for replication to catch up. For modeling

simplicity, we ignore the wall-clock temporal aspect of this

mode’s semantics, and consider only operation count.

Session consistency. Reads and writes are tagged with

session tokens. Operations with the same session token are

linearizable relative to one another, but no guarantees are

provided across different session tokens. Session consistency

writes are not guaranteed durable, and tokens may be invali-

dated by data loss.

Consistent preﬁx. Reads are monotonic: a client may only

read newer values than it has already read. Section IV-A

describes why we ﬁnd this is ultimately equivalent to eventual

consistency.

Eventual consistency. This level offers no ordering guaran-

tees, but does provide a notion of eventual convergence over

an arbitrary period of time.

B. Data Deﬁnitions

Our model of Cosmos DB is deﬁned to have 4 state

variables, and allows them to evolve over time via some simple

actions. Each variable relates to a speciﬁc aspect of the system

being modeled.

log. The log is a sequence of writes, represented as key-

value pairs. For example, [key 7→ k1,value 7→ v1]pairs

key k1 with value v1. The sequence lists all writes that are

stored anywhere in Cosmos DB’s implementation, irrespective

of replication or durability.

readIndex. The readIndex marks either a position in

the log or 0. For any element of the log, if its index

is less than or equal to readIndex, then it is replicated

universally across the current Cosmos DB deployment. Rep-

resenting eventually-complete propagation of writes, the log

preﬁx deﬁned by readIndex behaves identically to a single

key-value store.

commitIndex. The “commit index” marks a position in

the log or 0. For any element of the log, if its in-

dex is less than or equal to commitIndex, then it is

replicated at a global majority of replicas, and is therefore

durable due to consensus. It follows from this deﬁnition that

readIndex ≤commitIndex must always hold.

epoch. The epoch is a monotonically increasing counter

of fail-overs. If epoch remains constant, fail-overs behavior

such as data loss may not be observed. If it increases, some

data loss may be observed at the point of increase.

The speciﬁcation has six constants: Keys and Values,

which are the sets of keys and values respectively. These

sets can be redeﬁned based on the use case — they can

be generalized to inﬁnite sets like “all strings”, or restricted

to a small ﬁnite set of constant values in order to simplify

model checking. NoValue is a constant indicating the ab-

sence of a value. VersionBound and StalenessBound

are natural numbers that affect when writes are allowed.

WriteConsistencyLevel represents the currently con-

ﬁgured consistency level for write operations, one of the 5

consistency levels.

We chose to base our speciﬁcation on a sequential log

because Cosmos DB, like any consensus-based system, de-

termines a total order in which clients should consider their

requests to have occurred. This is why many parts of our

speciﬁcation, including several state variables, identify writes

by log index.

Building on these deﬁnitions, we can express our ﬁrst two

fundamental actions.

IncreaseReadIndexAndOrCommitIndex ∆

∧commitIndex 0∈commitIndex . . Len(log)

∧readIndex 0∈readIndex . . commitIndex 0

∧UNCHANGED hlog,epochi

IncreaseReadIndexAndOrCommitIndex models the

concept of data replication, that is, readIndex

and/or commitIndex advancing. readIndex and

commitIndex may non-deterministically gain new

values between readIndex and commitIndex0, and

commitIndex and Len(log), respectively. Neither log

nor epoch may change. This ensures that both values may

only grow, that they never point beyond end of the log, and

that readIndex ≤commitIndex remains true.

TruncateLog ∆

∃i∈(commitIndex + 1) . . Len(log) :

∧log0=SubSeq(log,1,i−1)

∧epoch0=epoch + 1

∧UNCHANGED hreadIndex ,commitIndex i

TruncateLog models the concept of data loss: if there exists

any index isuch that commitIndex <i, then log may be

truncated non-deterministically such that its new length is i−

1. In-progress operations may watch for changes in epoch’s

value to detect and respond to failures, meaning epoch acts

as a failure detector.

Because these actions may happen non-deterministically,

any combination of replication and fail-over may occur at any

time, interleaved with other actions. A short sequence of such

actions can represent a complex series of implementation-level

possibilities.

C. Write Operations

In Cosmos DB, write operations are not atomic. They

may sometimes appear atomic under certain conﬁgurations1,

but their underlying structure needs to be broken down into

multiple steps.

As a consequence of writes’ multi-step nature, we need

to record the state of in-progress writes. For portability, we

don’t require any particular state retention mechanism, as the

speciﬁcs might vary depending on how our core speciﬁcation

is used. Instead, we break up the two conceptual stages of

a Cosmos DB write into re-usable parts that we describe

individually.

WritesAccepted ∆

∧Len(logv )−readIndex <VersionBound

∧((WriteConsistencyLevel =BoundedStaleness)⇒

Len(logv )−commitIndex <StalenessBound )

1) Beginning a Write Operation: WritesAccepted deter-

mines whether a write may be attempted at all. It con-

strains writes based on two factors: VersionBoundand

StalenessBound.VersionBound is a global limit on

how many partially-replicated writes may exist in a Cosmos

DB instance at any one time. StalenessBound is a global

limit on how many non-durable writes may exist in a Cosmos

DB instance at any one time, used to enforce bounded staleness

consistency.

WriteInit(key,value)∆

log0=Append(log,[key 7→ key,value 7→ value])

WriteInit deﬁnes the initial stage of any permitted write

operation, appending a new key-value pair to the log. This

means that at least one replica now holds the new key-value

pair. The lack of distinction between incomplete and complete

writes is intentional here: Cosmos DB replicas unconditionally

begin serving writes as soon as they accept them. The Cosmos

DB client libraries are the ones that enforce Cosmos DB’s

read semantics, and they may perform multiple read requests

against multiple replicas until they get a consistent answer that

can be returned to an end-user. Cosmos DB replicas require

no additional logic restricting which writes should be visible

to which read requests.

WriteInitToken ∆

[epoch 7→ epoch,checkpoint 7→ Len(log) + 1]

WriteInitToken deﬁnes a unique identiﬁer, or token, with

which we can keep track of a write’s progress. This token

is structurally identical to a session token, the data used to

identify a client’s session at session consistency. Note that in

practice, these tokens represent the ﬂow of a request from

1For instance, a client performing only strongly consistent reads and

strongly consistent writes will never witness an in-progress write. Weaker

consistency levels do not provide any such guarantees, however. See Section

IV-D for speciﬁc examples.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnderstandingInconsistencyinAzureCosmosDBwithTLA+A.FinnHackettUniversityofBritishColumbiaVancouver,Canadafhackett@cs.ubc.caJoshuaRoweMicrosoftRedmond,USAjoshua.rowe@microsoft.comMarkusAlexanderKuppeMicrosoftResearchRedmond,USAmakuppe@microsoft.comAbstractBeyondimplementationcorrectnessofadistribute...

收起<<

Understanding Inconsistency in Azure Cosmos DB with TLA A. Finn Hackett.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Understanding Inconsistency in Azure Cosmos DB with TLA A. Finn Hackett

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: