
constructs and temporal logic. Models written in TLA+have
no direct correspondence to implementations, with users fo-
cusing instead on analyzing design decisions and verifying
model-level correctness properties. This philosophy allows
specification writers to leave out irrelevant details and focus
on expressing a specification’s core semantics as simply as
possible.
In addition to plain TLA+, model developers can also write
models in PlusCal [11], a high-level imperative language that
is more like contemporary programming languages.
It is possible to check model properties using the explicit-
state model checker TLC [12], [13], the symbolic model
checker Apalache [14], and the manual proof assistant
TLAPS [15]. In this work, we relied on the TLC model
checker to analyze our specification.
As well as model checking temporal properties, it is also
possible to express and check refinements [16] in TLA+. A
refinement proves that one specification implements another
– meaning that one specification exhibits every behavior that
another specification exhibits, given an appropriate translation
between the two specifications’ state spaces. We use this tech-
nique to show that our new specification produces a superset
of the behaviors produced by existing TLA+specifications of
Cosmos DB.
III. A SIMPLE MODEL OF COSMOS DB
To fully illustrate our claim to simplicity, this section
describes our full formalization of Cosmos DB’s semantics
in a few pages, including most of the core TLA+defini-
tions in-text. Our modeling process was based on iterative
discussion with author 2, a principal engineer working on the
Cosmos DB implementation. We followed existing user-facing
documentation, asked for feedback, learned more about the
realities of Cosmos DB’s design, and incorporated that new
knowledge into our specification. We repeated this feedback
loop until we found no more corrections, when our model
began to predict counter-intuitive but possible behaviors of
the real system. See Subsection IV-D for in-depth analysis of
such behaviors. See https://github.com/tlaplus/azure-cosmos-
tla/tree/master/simple-model for the full TLA+definitions.
A. Consistency Levels
Cosmos DB offers 5 consistency levels that affect read
and write behavior. A system administrator must configure
all writes to follow a single consistency level per Cosmos DB
deployment. Read operations may either match the configured
write consistency level or weaken it according to the hierarchy
defined in Figure 1. We discuss high-level prose descriptions
of these consistency levels, which we complement with precise
TLA+descriptions later on.
Strong consistency. Reads and writes are linearizable [17],
as long as the operation does not fail. See Section IV-D for
possible behaviors given failures.
Bounded staleness. Writes older than a given time bound
are durable and consistently readable, whereas writes younger
than the given bound are not. The time bound is defined two
Bounded Staleness
Session
Consistent Prefix
Eventual
Strong
Fig. 1. Hierarchy of consistency levels in Cosmos DB, with strongest at the
top and weakest guarantees at the bottom.
ways: a bound in wall-clock time, and a maximum bound on
the number of eventually-consistent writes allowed at once.
If the bounds are in danger of being exceeded, additional
writes will be refused for replication to catch up. For modeling
simplicity, we ignore the wall-clock temporal aspect of this
mode’s semantics, and consider only operation count.
Session consistency. Reads and writes are tagged with
session tokens. Operations with the same session token are
linearizable relative to one another, but no guarantees are
provided across different session tokens. Session consistency
writes are not guaranteed durable, and tokens may be invali-
dated by data loss.
Consistent prefix. Reads are monotonic: a client may only
read newer values than it has already read. Section IV-A
describes why we find this is ultimately equivalent to eventual
consistency.
Eventual consistency. This level offers no ordering guaran-
tees, but does provide a notion of eventual convergence over
an arbitrary period of time.
B. Data Definitions
Our model of Cosmos DB is defined to have 4 state
variables, and allows them to evolve over time via some simple
actions. Each variable relates to a specific aspect of the system
being modeled.
log. The log is a sequence of writes, represented as key-
value pairs. For example, [key 7→ k1,value 7→ v1]pairs
key k1 with value v1. The sequence lists all writes that are
stored anywhere in Cosmos DB’s implementation, irrespective
of replication or durability.
readIndex. The readIndex marks either a position in
the log or 0. For any element of the log, if its index
is less than or equal to readIndex, then it is replicated
universally across the current Cosmos DB deployment. Rep-
resenting eventually-complete propagation of writes, the log
prefix defined by readIndex behaves identically to a single
key-value store.
commitIndex. The “commit index” marks a position in
the log or 0. For any element of the log, if its in-
dex is less than or equal to commitIndex, then it is
replicated at a global majority of replicas, and is therefore
durable due to consensus. It follows from this definition that
readIndex ≤commitIndex must always hold.
epoch. The epoch is a monotonically increasing counter
of fail-overs. If epoch remains constant, fail-overs behavior