The Case for Accelerating BFT Protocols Using In-Network Ordering Guangda Sun Xin Zhe Khooi Yunfan Li Mingliang Jiang and Jialin Li National University of Singapore

2025-05-06 0 0 508.09KB 19 页 10玖币
侵权投诉
The Case for Accelerating BFT Protocols Using In-Network Ordering
Guangda Sun, Xin Zhe Khooi, Yunfan Li, Mingliang Jiang, and Jialin Li
National University of Singapore
Abstract
Mission critical systems deployed in data centers today are
facing more sophisticated failures. Byzantine fault tolerant
(BFT) protocols are capable of masking these types of fail-
ures, but are rarely deployed due to their performance cost
and complexity. In this work, we propose a new approach
to designing high performance BFT protocols in data cen-
ters. By re-examining the ordering responsibility between
the network and the BFT protocol, we advocate a new ab-
straction offered by the data center network infrastructure.
Concretely, we design a new authenticated ordered multi-
cast primitive (AOM) that provides transferable authentication
and non-equivocation guarantees. Feasibility of the design is
demonstrated by two hardware implementations of AOM– one
using HMAC and the other using public key cryptography for
authentication – on new-generation programmable switches.
We then co-design a new BFT protocol, Matrix, that leverages
the guarantees of AOM to eliminate cross-replica coordination
and authentication in the common case. Evaluation results
show that Matrix outperforms state-of-the-art protocols on
both latency and throughput metrics by a wide margin, demon-
strating the benefit of our new network ordering abstraction
for BFT systems.
1 Introduction
Online services today are commonly deployed in large data
centers and rely on fault-tolerance protocols to provide high
availability in the presence of failures. An important class of
fault-tolerance protocols is state machine replication (SMR).
SMR protocols [36,39,44,48,55] have long been deployed
in production systems [18,20,24,32] to ensure a set of dis-
tributed nodes behaves like a single, always available state
machine, despite failures of individual machines. These proto-
cols, however, can only tolerate node crash failures. In reality,
systems running in data centers are facing more sophisticated
failures. This is particularly relevant today, as permissioned
blockchain systems [3,8,50] are increasingly being deployed
in data centers for applications such as trading [2,4]. These
systems require tolerance to adversarial nodes and attacks.
Recent work [50] has shown that fault tolerance is becoming
their main performance bottleneck.
Numerous Byzantine fault tolerant (BFT) protocols [19,23,
38,47,56,58,60] have been proposed to handle arbitrary node
failures. Their strong failure models, however, come with sig-
nificant performance implications. BFT protocols typically in-
cur rounds of replica communication coupled with expensive
cryptographic operations, resulting in low system throughput
and high request latency. To obtain higher throughput, many
BFT protocols rely on heavy request batching which leads
to long end-to-end decision latency – often in the range of
tens of milliseconds. Unfortunately, such latency overheads
are prohibitive for modern data center applications with strin-
gent service-level objectives (SLOs). Speculative BFT proto-
cols such as Zyzzyva [38] offer better commitment latency.
However, even a single faulty replica would eliminate the
performance benefit of these latency-optimized protocols.
In this paper, we propose a new approach to building high-
performance BFT protocols in data centers. We observe that
traditional BFT protocols are designed with minimum as-
sumptions about the underlying network: The network only
provides best effort message delivery. The result of this weak
network model is that application-level protocols are respon-
sible for enforcing all correctness properties such as total
ordering, durability, and authentication. Our key insight is
that strengthening the network model, in which the network
can guarantee ordered message delivery, can lead to reduced
complexity and performance overhead of BFT protocols. Prior
work [42,43] have shown promises of an in-network sequenc-
ing approach for crash fault-tolerant systems. However, ex-
isting approaches fail to work once Byzantine failures are
introduced: Faulty nodes can disseminate conflicting mes-
sage orders, and the network sequencer may equivocate by
assigning different sequence numbers to each replica. In this
work, we propose a new network-level primitive, AOM, that
addresses the above challenges. AOM ensures that correct re-
ceivers always deliver multicast messages in the same order,
even in the presence of Byzantine participants. The key prop-
erty offered by AOM is transferable in-network authentication
1
arXiv:2210.12955v1 [cs.DC] 24 Oct 2022
– receivers can verify that a multicast message is properly
delivered by AOM, and they can prove authenticity of the
message to other receivers in the system. We additionally
propose a mixed failure model [45,54] in which the network
infrastructure can be either crash- or Byzantine-faulty. For de-
ployments that trust the network infrastructure, AOM provides
ordering guarantees with minimum network-level overhead.
For systems that need to tolerate Byzantine network devices,
AOM uses a simple round of cross-receiver communication to
handle equivocating sequencers.
We demonstrate the feasibility of our approach by imple-
menting our AOM primitive in off-the-shelf programmable
switches. The switch data plane performs both sequencing and
authentication for AOM messages. While packet sequencing
is relatively straightforward, generating secure authentication
codes – a complex mathematical procedure – is a major chal-
lenge given the switch’s limited resources and computational
constraints. We propose two implementations of in-switch
message authentication, each with trade-offs among switch
resource utilization, performance, and scalability. The first
variant implements message authentication code (HMAC)
vectors in the switch ASICs using SipHash [9] as the hash-
ing algorithm. The second variant generates signatures using
public-key cryptography. Due to hardware constraints, di-
rect in-switch implementation of cryptographic algorithms
such as RSA [52] and ECDSA [35] remains infeasible. We
design a new heterogeneous switch architecture that tightly
couples FPGA-based cryptographic accelerators to the switch
pipelines. Our design enables efficient in-network processing
and signing of AOM messages, scales linearly with the number
of cryptographic accelerators attached, and requires minimum
hardware resources on the switch data plane.
Leveraging the strong properties provided by AOM, we co-
design a new BFT protocol, Matrix. In the common case,
Matrix replicas rely on the ordering guarantees of AOM to
commit client requests in a single round trip, eliminating all
cross replica communication and authentication. Moreover,
Matrix stays in this fast path protocol even in the presence of
(up to
f
) faulty replicas, while requiring the theoretical mini-
mum replication factor (
3f+1
). In the rare case of network
failures, we design efficient protocols to handle packet drops
and faulty switch sequencers while guaranteeing correctness.
By evaluating against state-of-the-art BFT protocols, we show
that Matrix can improve protocol throughput by up to 3.4
×
and end-to-end latency by 42
×
, demonstrating the benefit
of our authenticated in-network ordering approach for BFT
systems.
2 Background
In this section, we give an overview of state-of-the-arts
BFT protocols. We then review recent proposals that use in-
network ordering to accelerate crash fault-tolerant systems.
2.1 State-of-the-Art BFT Protocols
There has been a long line of work on BFT state machine
replication (SMR) protocols. Table 1 presents a summary
of the properties and comparison of some recent represen-
tative protocols. PBFT [19] is the first practical BFT proto-
col that tolerates up to
f
Byzantine nodes, requiring at least
3f+1
replicas which has been shown to be theoretical lower
bound [17]. In PBFT, client requests are committed in five
message delays: clients send requests to a primary replica; the
primary replica sequences and forwards the requests to the
backup replicas; backup replicas authenticate the requests and
broadcast their acceptance; after a replica receives quorum
acceptance, it broadcasts a commit decision; replicas execute
the request and reply to the client once they collect quorum
commit decisions. As replicas exchange messages in an all-
to-all fashion, each replica processes
O(N)
messages, making
the authenticator complexity O(N2).
Zyzzyva [38] speculatively executes client requests before
they are committed to reduce the communication overhead.
The protocol includes a fast path with three message delays
when clients receiving matching replies from all replicas, and
otherwise a slow path with at least five message delays. The
primary replica in Zyzzyva still sends signed messages to all
back up replicas (
O(N)
), but with all-to-all communication
removed, the authenticator complexity is reduced to
O(N)
.
Replicas in Zyzzyva may need to rollback speculatively exe-
cuted operations during view changes.
Unlike Zyzzyva which pushes the responsibility of collect-
ing authenticators to the client, SBFT [29] uses round-robin
message collector among all replicas to remove all-to-all com-
munication, similarly reducing authenticator complexity to
O(N)
. SBFT also leverages threshold signatures to reduce
message size, and simultaneously decreases the number of
client replies to one per decision.
Many BFT protocols ( [19,29,38]) use an expensive view
change protocol to handle primary leader failure. The standard
view change protocol used in PBFT requires
O(N3)
message
authenticators, limiting the overall scalability. HotStuff [58]
addresses this issue by adding an extra phase in normal oper-
ation. This design reduces the authenticator complexity of the
leader failure protocol to
O(N)
, matching that of the normal
case protocol. In return, HotStuff adds a one-way network
latency to the request commit delay.
BFT with trusted components.
To reduce protocol com-
plexity, a recent line of work [23,25,41,56,61] introduces
trusted components on each replica. These trusted compo-
nents can be implemented in a Trusted Platform Module
(TPM) [5] or run in a trusted hypervisor, and are assumed to
always behave correctly even if residing on Byzantine nodes.
A2M-PBFT-EA [23] uses an attested append-only memory
(A2M) to securely store operations as entries in a log. Each
A2M log entry is associated with a monotonically increasing,
gap-less sequence number. Once appended, a log entry be-
2
PBFT [19] Zyzzyva [38] SBFT [29] HotStuff [58] A2M [23] MinBFT [56] Matrix
Replication Factor 3 f+1 3 f+1 3 f+1 3 f+1 2 f+1 2 f+1 3 f+1
Bottleneck Complexity O(N)O(N)O(N)O(N)O(N)O(N)O(1)
Authenticator Complexity O(N2)O(N)O(N)O(N)O(N2)O(N2)O(N)
Message Delay 5 3 6 4 5 4 2
Table 1: Comparison of Matrix to state-of-the-art BFT protocols. Here, bottleneck complexity denotes the number of messages
the bottleneck replica needs to process; authenticator complexity shows the total number of signatures processed by all replicas.
comes immutable, and its content can be attested by any node
in the system. A2M enables replicas to eliminate equivocation,
and thus, reduce the replication factor to
2f+1
. However, the
protocol still incurs the same bottleneck complexity, authenti-
cator complexity, and message latency as PBFT. A later work,
TrInc [41], reduces the trusted component to a single counter,
and leaves the log in untrusted memory.
MinBFT [56] features a message-based trusted primitive
called Unique Sequential Identifier Generator (USIG). For
each input message, USIG generates a unique identifier that
is monotonic, sequential, and verifiable. By authenticating
USIG identifiers, MinBFT replicas can validate that all other
replicas have received the same messages in the same order.
This property allows the protocol to reduce the message delay
to four. Unfortunately, MinBFT’s authenticator complexity
remains at O(N2).
2.2 In-Network Ordering for CFT Protocols
Another recent line of work [42,43,49] proposes a new ap-
proach to designing crash fault tolerant (CFT) SMR protocols.
These systems move the responsibility of request ordering to
the data center network, so application-level protocols only
ensure durability of client operations. This network co-design
approach improves SMR protocol performance by reducing
coordination overhead among servers to commit an operation.
For instance, NOPaxos [43] dedicates a programmable switch
in the network as a sequencer. The switch stamps sequence
numbers to each NOPaxos request, and all replicas execute
requests in the same sequence number order. However, all
these solutions only target CFT protocols and they fail to work
if participants can exhibit Byzantine faults, e.g., a Byzantine
node can impersonate the sequencer and broadcast conflicting
message orders.
3 Authenticated In-Network Ordering
The core of a BFT SMR protocol is to establish a consistent or-
der of requests even in the presence of failures. Traditionally,
this task is accomplished by explicit communication among
the replicas, usually coordinated by a leader. In this work, we
argue for a new approach to improving the efficiency of BFT
protocols, one which the responsibility of request ordering is
moved to the network infrastructure.
3.1 The Case for an Authenticated Ordering Service in
Data Center Networks
To guarantee linearizability [30], BFT SMR protocols require
all non-faulty replicas to execute client requests in the same
order. Due to the best-effort assumption about the network, an
application-level protocol is fully responsible for establishing
a total order of requests among the replicas. Take PBFT [19]
as an example: The primary replica first assigns an order to
client requests before broadcasting to backup replicas. All
replicas then use two rounds of communication to agree on
this ordering while tolerating faulty participants. As discussed
in §2, even adding trusted components to each replica does
not alleviate the coordination and authentication overhead in
BFT protocols: replicas still require remote attestations to
verify the received messages.
What if the underlying network can offer stronger guar-
antees? Prior work [42,43,49] have already demonstrated
that in-network ordering, realized using network programma-
bility [16,33], can offer compelling performance benefits to
crash fault tolerant SMR protocols. In this work, we argue
that BFT protocols can similarly benefit from moving the
ordering responsibility to the network. At a high level, by of-
floading request ordering to the network, BFT replicas avoid
explicit communication to establish an execution order, effec-
tively reducing cross-replica coordination and authentication
overhead. This network ordering approach improves both pro-
tocol throughput – less work is performed on each replica –
and latency – fewer message delays are required to commit a
request.
However, achieving message ordering in the network is
more challenging in a BFT setting. In this section, we discuss
the main challenges faced by a BFT-based network ordering
primitive and our approach to address them.
Why authenticated ordering in the network?
In prior
network ordering systems, the network primitive (such as
the Ordered Unreliable Multicast in NOPaxos [43] and the
multi-sequenced groupcast in Eris [42]) is fully responsibility
for assigning an order to client requests. With non-Byzantine
participants, this network-level ordering is the only request
order observed by any replica. Unfortunately, in a BFT de-
ployment model, a faulty node can easily impersonate the
network primitive and assign a conflicting message order, vio-
lating the ordering guarantee of the network layer. To tolerate
3
equivocating Byzantine participants, we augment the network
primitive to provide an authentication property: Non-faulty
replicas can independently validate that the received mes-
sage order is indeed established by the network, not by any
faulty node. We elaborate on how such authentication can be
implemented efficiently on commodity switch hardware in
§4.
Hybrid fault model and Byzantine network.
If the net-
work itself is Byzantine, it can equivocate by assigning differ-
ent message orders to different replicas, violating the ordering
guarantee. In this work, we argue for a dual fault model
one fault model for the participating nodes and one for the
network. Our argument is inspired by prior work that pro-
poses hybrid fault model [54] and work [45] that separates
machine faults from network faults. Specifically, we always
assume a Byzantine failure mode for end-hosts. The network
infrastructure, on the other hand, can either be crash-faulty or
Byzantine-faulty. The benefit of our approach is that we offer
flexibility with an explicit trade-off between fault tolerance
and performance. For deployments that trust the network to
only exhibit crash and omission faults, our solution offers the
optimal performance; if the deployment assumes the network
infrastructure can behave arbitrarily, we offer solution that
tolerates Byzantine faults in the network, albeit taking a small
performance penalty.
We further argue that a hybrid fault model, in which the
network is assumed to be crash-faulty, is a practical option for
many systems deployed in data centers. Networking hardware
presents a smaller attack surface and is less vulnerable to bugs
compared to software-based components. They are single
application ASICs without sophisticated system software,
and formal verification of their hardware designs are common
practice. Systems running in data centers also inherently place
some trust in the data center hardware infrastructure. It is
also against the economic interest of data center operators to
violate the trust of their provided services.
Our model resembles existing deployment options in the
public cloud: Only deployments that do not trust the cloud
infrastructure run their virtual machines on instances with
a Trusted Execution Environment (TEE) such as Intel SGX.
The majority of use cases place trust on cloud hardware and
hypervisors; in return, they attain higher performance com-
pared to their TEE counterpart.
3.2 Authenticated Ordered Multicast
We have so far argued for an authenticated ordering service
in the network for BFT protocols. To that end, we propose a
new Authenticated Ordered Multicast (AOM) primitive as a
concrete instance of such model. Similar to other multicast
primitives (e.g., IP multicast), an AOM deployment consists
of one or multiple AOM groups, each identified by a unique
group address. AOM receivers can join and leave a AOM group
by contacting a membership service. A sender sends an AOM
message to one AOM group, and the network is responsi-
ble for routing the message to all group receivers. Note that
senders do not know the identity nor the address of individual
receivers; they only specify the group address as the destina-
tion.
Unlike traditional best-effort IP multicast, AOM provides a
set of stronger guarantees, which we formally define here:
Asynchrony.
There is no bound on the delivery latency of
AOM messages.
Unreliability.
There is no guarantee that an AOM message
will be received by any receiver in the destination group.
Authentication.
A receiver can verify the authenticity of
an AOM message, i.e., the message is correctly processed by
the AOM network primitive. A correct receiver only delivers
authentic AOM messages.
Transferable Authentication.
If a receiver
r1
forwards an
AOM message to another receiver
r2
,
r2
can independently
verify the authenticity of the message.
Ordering.
For any two authentic AOM messages
m1
and
m2
that destined to the same AOM group
G
, all correct receivers
in
G
that receive both
m1
and
m2
deliver them in the same
order.
Drop Detection.
Receivers can detect AOM message drops
in the network and deliver DROP-NOTIFICATION. Formally,
for any authentic AOM message
m
, either 1) all correct re-
ceivers in the destination group
G
delivers
m
or a DROP-
NOTIFICATION for
m
before delivering the next AOM mes-
sage, or 2) none of the correct receivers in
G
delivers
m
or
aDROP-NOTIFICATION for m.
A key differentiating property of AOM is the capability of
receivers to independently verify the authenticity of an AOM
message. Authenticity in our case does not refer to the identity
of the sender, which still requires end-to-end cryptography.
Instead, receivers can verify that a message is correctly pro-
cessed by the AOM primitive, and the ordering is not forged by
other participant in the system. The authentication capability
is also transferable: an AOM message can be relayed to any
other receiver in the group, and they can independently verify
the authenticity of the message.
Another important guarantee of AOM is non-equivocation,
i.e., the primitive never delivers conflicting order of messages
to non-faulty receivers in a group. As we show in §4, if the
network is assumed to be Byzantine-faulty, the primitive re-
quires additional confirmation among the group receivers to
guarantee non-equivocation. AOM, however, does not guar-
antee reliable transmission. The implication is that an AOM
message may be delivered by only a subset of the receivers in
a group. AOM ensures that receivers who miss the message
deliver a DROP-NOTIFICATION. The application-level BFT
protocol is then responsible for reaching consensus on the
fate of those dropped messages.
4
摘要:

TheCaseforAcceleratingBFTProtocolsUsingIn-NetworkOrderingGuangdaSun,XinZheKhooi,YunfanLi,MingliangJiang,andJialinLiNationalUniversityofSingaporeAbstractMissioncriticalsystemsdeployedindatacenterstodayarefacingmoresophisticatedfailures.Byzantinefaulttolerant(BFT)protocolsarecapableofmaskingthesetypes...

展开>> 收起<<
The Case for Accelerating BFT Protocols Using In-Network Ordering Guangda Sun Xin Zhe Khooi Yunfan Li Mingliang Jiang and Jialin Li National University of Singapore.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:508.09KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注