network.
Low latency. Modern networking hardware enables round-
trip times of a few microseconds for short messages. The
transport protocol must not add significantly to this latency,
so that applications experience latencies close to the hardware
limit. The transport protocol must also support low latency at
the tail, even under relatively high network loads with a mix
of traffic. Tail latency is particularly challenging for trans-
port protocols; nonetheless, it should be possible to achieve
tail latencies for short messages within a factor of 2–3x of the
best-case latency [19].
High throughput. The transport protocol must support high
throughput in two different ways. Traditionally, the term
“throughput” has referred to data throughput: delivering large
amounts of data in a single message or stream. This kind of
throughput is still important. In addition, datacenter appli-
cations require high message throughput: the ability to send
large numbers of small messages quickly for communication
patterns such as broadcast and shuffle [15]. Message through-
put has historically not received much attention, but it is es-
sential in datacenters.
In order to meet the above requirements, the transport pro-
tocol must also deal with the following problems:
Congestion control. In order to provide low latency, the
transport protocol must limit the buildup of packets in net-
work queues. Packet queuing can potentially occur both at
the edge (the links connecting hosts to top-of-rack switches)
and in the network core; each of these forms of congestion
creates distinct problems.
Efficient load balancing across server cores. For more than
a decade, network speeds have been increasing rapidly while
processor clock rates have remained nearly constant. Thus it
is no longer possible for a single core to keep up with a single
network link; both incoming and outgoing load must be dis-
tributed across multiple cores. This is true at multiple levels.
At the application level, high-throughput services must run
on many cores and divide their work among the cores. At the
transport layer, a single core cannot keep up with a high speed
link, especially with short messages. Load balancing impacts
transport protocols in two ways. First, it can introduce over-
heads (e.g. the use of multiple cores causes additional cache
misses for coherence). Second, load balancing can lead to hot
spots, where load is unevenly distributed across cores; this is a
form of congestion at the software level. Load balancing over-
heads are now one of the primary sources of tail latency [21],
and they are impacted by the design of the transport protocol.
NIC offload. There is increasing evidence that software-
based transport protocols no longer make sense; they simply
cannot provide high performance at an acceptable cost. For
example:
• The best software protocol implementations have end-
to-end latency more than 3x as high as implementations
where applications communicate directly with the NIC
via kernel bypass.
• Software implementations give up a factor of 5–10x
in small message throughput, compared with NIC-
offloaded implementations.
• Driving a 100 Gbps network at 80% utilization in both
directions consumes 10–20 cores just in the networking
stack [16, 21]. This is not a cost-effective use of re-
sources.
Thus, in the future, transport protocols will need to move
to special-purpose NIC hardware. The transport protocol
must not have features that preclude hardware implementa-
tion. Note that NIC-based transports will not eliminate soft-
ware load balancing as an issue: even if the transport is in
hardware, application software will still be spread across mul-
tiple cores.
3 Everything about TCP is wrong
This section discusses five key properties of TCP, which cover
almost all of its design:
• Stream orientation
• Connection orientation
• Bandwidth sharing (“fair” scheduling)
• Sender-driven congestion control
• In-order packet delivery
Each of these properties represents the wrong decision for a
datacenter transport, and each of these decisions has serious
negative consequences.
3.1 Stream orientation
The data model for TCP is a stream of bytes. However, this is
not the right data model for most datacenter applications. Dat-
acenter applications typically exchange discrete messages to
implement remote procedure calls. When messages are serial-
ized in a TCP stream, TCP has no knowledge about message
boundaries. This means that when an application reads from
a stream, there is no guarantee that it will receive a complete
message; it could receive less than a full message, or parts of
several messages. TCP-based applications must mark mes-
sage boundaries when they serialize messages (e.g., by pre-
fixing each message with its length), and they must use this
information to reassemble messages on receipt. This intro-
duces extra complexity and overheads, such as maintaining
state for partially-received messages.
The streaming model is disastrous for software load balanc-
ing. Consider an application that uses a collection of threads
to serve requests arriving across a collection of streams. Ide-
ally, all of the threads would wait for incoming messages
on any of the streams, with messages distributed across the
threads. However, with a byte stream model there is no guar-
antee that a read operation returns an entire message. If mul-
tiple threads both read from a stream, it is possible that parts
of a single message might be received by different threads. In
principle it might be possible for the threads to coordinate and
reassemble the entire message in one of the threads, but this
is too expensive to be practical.
Instead, TCP applications must use one of two inferior
forms of load balancing, in which each stream is owned by a
2