Understanding Inconsistency in Azure Cosmos DB with TLA A. Finn Hackett

2025-05-06 1 0 402.77KB 11 页 10玖币
侵权投诉
Understanding Inconsistency in Azure Cosmos DB
with TLA+
A. Finn Hackett
University of British Columbia
Vancouver, Canada
fhackett@cs.ubc.ca
Joshua Rowe
Microsoft
Redmond, USA
joshua.rowe@microsoft.com
Markus Alexander Kuppe
Microsoft Research
Redmond, USA
makuppe@microsoft.com
Abstract—Beyond implementation correctness of a distributed
system, it is equally important to understand exactly what users
should expect to see from that system. Even if the system itself
works as designed, insufficient understanding of its user-visible
semantics can cause bugs in its dependencies. By focusing a
formal specification effort on precisely defining the expected user-
facing behaviors of the Azure Cosmos DB service at Microsoft,
we were able to write a formal specification of the database that
was significantly smaller and conceptually simpler than any other
specification of Cosmos DB, while representing a wider range
of valid user-observable behaviors than existing more detailed
specifications. Many of the additional behaviors we documented
were previously poorly understood outside of the Cosmos DB
development team, even informally, leading to data consistency
errors in Microsoft products that depend on it. Using this model,
we were able to raise two key issues in Cosmos DB’s public-facing
documentation, which have since been addressed. We were also
able to offer a fundamental solution to a previous high-impact
outage within another Azure service that depends on Cosmos
DB.
Index Terms—cloud computing, formal methods, model check-
ing, documentation
I. INTRODUCTION
Consistency guarantees for distributed databases are noto-
riously hard to understand. Not only can distributed systems
inherently behave in unexpected and counter-intuitive ways
due to internal concurrency and failures, but they can also
lull their users into a false sense of functional correctness:
most of the time, users of a distributed database will witness
a much simpler and more consistent set of behaviors than
what is actually possible. Only timeouts, fail-overs, or other
rare events will expose the true set of behaviors a user
might witness [1]. Testing for these scenarios is difficult at
best: reproducing them reliably requires controlling complex
concurrency factors, latency variations, and network behaviors.
Even just producing usable documentation for developers
is fundamentally challenging [2], [3], [4], and explaining
these subtle consistency issues via documentation comes as
an additional burden to distributed system developers and
technical writers alike. Formal methods have long been applied
to the design of distributed systems, including in industry [5],
[6], [7], [8], but these are years-long high-effort projects
that focus on implementation correctness, not explaining the
system to users. Rather than focus on this difficult task, we
address a simpler and more fundamental question: ignoring
the implementation, what kind of behavior should a client be
able to witness while interacting with a service?
We use TLA+to answer this simpler question for Cosmos
DB, a planet-scale key-value store. In practice, Cosmos DB
offers a rich interface featuring multiple query APIs, and
has complex operational behaviors involving georeplication
and partitioning of data. As our focus is on data consistency
from a client perspective, we model only the core read and
write operations underlying the system’s semantics relating
to their 5 configurable consistency levels. We show that this
minimal client-focused specification of a large-scale service
offers important design- and documentation-level insights,
while keeping buy-in cost low.
We document the 2 person-month development process
of our specification, which consists of iterative prototyping
using the public documentation [9], feedback from author 2,
a Cosmos DB developer, and the specification and model
checking of a collection of formal properties based on our
understanding. Aside from the specification itself, we discuss
a pair of key issues it helped us discover within Cosmos DB’s
documentation, and how both have since been addressed. We
also use our specification to explain the previously-unclear
root cause of a 28-day high-priority outage within Microsoft
Azure.
We describe the following results: (1) a concise (390
LOC) client-focused specification of Cosmos DB, a large-
scale distributed system; (2) a pair of key documentation bugs
we found by developing our specification — statements in
Cosmos DB’s public documentation [9] that have now been
corrected; and (3) using our specification, a novel and concise
mechanized explanation of a high-severity Cosmos DB-related
outage within Azure that took 28 days to identify and mitigate.
Beyond our work so far, we expect our model to be
useful in future design work as Cosmos DB’s implementation
evolves, aided by its ability to precisely and abstractly state a
client’s expectations of system behavior. Services depending
on Cosmos DB may also benefit from incorporating our work
into TLA+specifications of their own processes, in which case
our work may be used to prevent future outages similar to the
one we describe in this paper.
II. BACKGROUND
Our work uses the TLA+specification language [10], which
can be used to describe state machines using set-theoretic
arXiv:2210.13661v1 [cs.SE] 24 Oct 2022
constructs and temporal logic. Models written in TLA+have
no direct correspondence to implementations, with users fo-
cusing instead on analyzing design decisions and verifying
model-level correctness properties. This philosophy allows
specification writers to leave out irrelevant details and focus
on expressing a specification’s core semantics as simply as
possible.
In addition to plain TLA+, model developers can also write
models in PlusCal [11], a high-level imperative language that
is more like contemporary programming languages.
It is possible to check model properties using the explicit-
state model checker TLC [12], [13], the symbolic model
checker Apalache [14], and the manual proof assistant
TLAPS [15]. In this work, we relied on the TLC model
checker to analyze our specification.
As well as model checking temporal properties, it is also
possible to express and check refinements [16] in TLA+. A
refinement proves that one specification implements another
– meaning that one specification exhibits every behavior that
another specification exhibits, given an appropriate translation
between the two specifications’ state spaces. We use this tech-
nique to show that our new specification produces a superset
of the behaviors produced by existing TLA+specifications of
Cosmos DB.
III. A SIMPLE MODEL OF COSMOS DB
To fully illustrate our claim to simplicity, this section
describes our full formalization of Cosmos DB’s semantics
in a few pages, including most of the core TLA+defini-
tions in-text. Our modeling process was based on iterative
discussion with author 2, a principal engineer working on the
Cosmos DB implementation. We followed existing user-facing
documentation, asked for feedback, learned more about the
realities of Cosmos DB’s design, and incorporated that new
knowledge into our specification. We repeated this feedback
loop until we found no more corrections, when our model
began to predict counter-intuitive but possible behaviors of
the real system. See Subsection IV-D for in-depth analysis of
such behaviors. See https://github.com/tlaplus/azure-cosmos-
tla/tree/master/simple-model for the full TLA+definitions.
A. Consistency Levels
Cosmos DB offers 5 consistency levels that affect read
and write behavior. A system administrator must configure
all writes to follow a single consistency level per Cosmos DB
deployment. Read operations may either match the configured
write consistency level or weaken it according to the hierarchy
defined in Figure 1. We discuss high-level prose descriptions
of these consistency levels, which we complement with precise
TLA+descriptions later on.
Strong consistency. Reads and writes are linearizable [17],
as long as the operation does not fail. See Section IV-D for
possible behaviors given failures.
Bounded staleness. Writes older than a given time bound
are durable and consistently readable, whereas writes younger
than the given bound are not. The time bound is defined two
Bounded Staleness
Session
Consistent Prefix
Eventual
Strong
Fig. 1. Hierarchy of consistency levels in Cosmos DB, with strongest at the
top and weakest guarantees at the bottom.
ways: a bound in wall-clock time, and a maximum bound on
the number of eventually-consistent writes allowed at once.
If the bounds are in danger of being exceeded, additional
writes will be refused for replication to catch up. For modeling
simplicity, we ignore the wall-clock temporal aspect of this
mode’s semantics, and consider only operation count.
Session consistency. Reads and writes are tagged with
session tokens. Operations with the same session token are
linearizable relative to one another, but no guarantees are
provided across different session tokens. Session consistency
writes are not guaranteed durable, and tokens may be invali-
dated by data loss.
Consistent prefix. Reads are monotonic: a client may only
read newer values than it has already read. Section IV-A
describes why we find this is ultimately equivalent to eventual
consistency.
Eventual consistency. This level offers no ordering guaran-
tees, but does provide a notion of eventual convergence over
an arbitrary period of time.
B. Data Definitions
Our model of Cosmos DB is defined to have 4 state
variables, and allows them to evolve over time via some simple
actions. Each variable relates to a specific aspect of the system
being modeled.
log. The log is a sequence of writes, represented as key-
value pairs. For example, [key 7→ k1,value 7→ v1]pairs
key k1 with value v1. The sequence lists all writes that are
stored anywhere in Cosmos DB’s implementation, irrespective
of replication or durability.
readIndex. The readIndex marks either a position in
the log or 0. For any element of the log, if its index
is less than or equal to readIndex, then it is replicated
universally across the current Cosmos DB deployment. Rep-
resenting eventually-complete propagation of writes, the log
prefix defined by readIndex behaves identically to a single
key-value store.
commitIndex. The “commit index” marks a position in
the log or 0. For any element of the log, if its in-
dex is less than or equal to commitIndex, then it is
replicated at a global majority of replicas, and is therefore
durable due to consensus. It follows from this definition that
readIndex commitIndex must always hold.
epoch. The epoch is a monotonically increasing counter
of fail-overs. If epoch remains constant, fail-overs behavior
such as data loss may not be observed. If it increases, some
data loss may be observed at the point of increase.
The specification has six constants: Keys and Values,
which are the sets of keys and values respectively. These
sets can be redefined based on the use case — they can
be generalized to infinite sets like “all strings”, or restricted
to a small finite set of constant values in order to simplify
model checking. NoValue is a constant indicating the ab-
sence of a value. VersionBound and StalenessBound
are natural numbers that affect when writes are allowed.
WriteConsistencyLevel represents the currently con-
figured consistency level for write operations, one of the 5
consistency levels.
We chose to base our specification on a sequential log
because Cosmos DB, like any consensus-based system, de-
termines a total order in which clients should consider their
requests to have occurred. This is why many parts of our
specification, including several state variables, identify writes
by log index.
Building on these definitions, we can express our first two
fundamental actions.
IncreaseReadIndexAndOrCommitIndex
=
commitIndex 0commitIndex . . Len(log)
readIndex 0readIndex . . commitIndex 0
UNCHANGED hlog,epochi
IncreaseReadIndexAndOrCommitIndex models the
concept of data replication, that is, readIndex
and/or commitIndex advancing. readIndex and
commitIndex may non-deterministically gain new
values between readIndex and commitIndex0, and
commitIndex and Len(log), respectively. Neither log
nor epoch may change. This ensures that both values may
only grow, that they never point beyond end of the log, and
that readIndex commitIndex remains true.
TruncateLog
=
i(commitIndex + 1) . . Len(log) :
log0=SubSeq(log,1,i1)
epoch0=epoch + 1
UNCHANGED hreadIndex ,commitIndex i
TruncateLog models the concept of data loss: if there exists
any index isuch that commitIndex <i, then log may be
truncated non-deterministically such that its new length is i
1. In-progress operations may watch for changes in epochs
value to detect and respond to failures, meaning epoch acts
as a failure detector.
Because these actions may happen non-deterministically,
any combination of replication and fail-over may occur at any
time, interleaved with other actions. A short sequence of such
actions can represent a complex series of implementation-level
possibilities.
C. Write Operations
In Cosmos DB, write operations are not atomic. They
may sometimes appear atomic under certain configurations1,
but their underlying structure needs to be broken down into
multiple steps.
As a consequence of writes’ multi-step nature, we need
to record the state of in-progress writes. For portability, we
don’t require any particular state retention mechanism, as the
specifics might vary depending on how our core specification
is used. Instead, we break up the two conceptual stages of
a Cosmos DB write into re-usable parts that we describe
individually.
WritesAccepted
=
Len(logv )readIndex <VersionBound
((WriteConsistencyLevel =BoundedStaleness)
Len(logv )commitIndex <StalenessBound )
1) Beginning a Write Operation: WritesAccepted deter-
mines whether a write may be attempted at all. It con-
strains writes based on two factors: VersionBoundand
StalenessBound.VersionBound is a global limit on
how many partially-replicated writes may exist in a Cosmos
DB instance at any one time. StalenessBound is a global
limit on how many non-durable writes may exist in a Cosmos
DB instance at any one time, used to enforce bounded staleness
consistency.
WriteInit(key,value)
=
log0=Append(log,[key 7→ key,value 7→ value])
WriteInit defines the initial stage of any permitted write
operation, appending a new key-value pair to the log. This
means that at least one replica now holds the new key-value
pair. The lack of distinction between incomplete and complete
writes is intentional here: Cosmos DB replicas unconditionally
begin serving writes as soon as they accept them. The Cosmos
DB client libraries are the ones that enforce Cosmos DB’s
read semantics, and they may perform multiple read requests
against multiple replicas until they get a consistent answer that
can be returned to an end-user. Cosmos DB replicas require
no additional logic restricting which writes should be visible
to which read requests.
WriteInitToken
=
[epoch 7→ epoch,checkpoint 7→ Len(log) + 1]
WriteInitToken defines a unique identifier, or token, with
which we can keep track of a write’s progress. This token
is structurally identical to a session token, the data used to
identify a client’s session at session consistency. Note that in
practice, these tokens represent the flow of a request from
1For instance, a client performing only strongly consistent reads and
strongly consistent writes will never witness an in-progress write. Weaker
consistency levels do not provide any such guarantees, however. See Section
IV-D for specific examples.
摘要:

UnderstandingInconsistencyinAzureCosmosDBwithTLA+A.FinnHackettUniversityofBritishColumbiaVancouver,Canadafhackett@cs.ubc.caJoshuaRoweMicrosoftRedmond,USAjoshua.rowe@microsoft.comMarkusAlexanderKuppeMicrosoftResearchRedmond,USAmakuppe@microsoft.comAbstract—Beyondimplementationcorrectnessofadistribute...

收起<<
Understanding Inconsistency in Azure Cosmos DB with TLA A. Finn Hackett.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:402.77KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注