The Case for Accelerating BFT Protocols Using In-Network Ordering Guangda Sun Xin Zhe Khooi Yunfan Li Mingliang Jiang and Jialin Li National University of Singapore

2025-05-06 1 0 508.09KB 19 页 10玖币

侵权投诉

The Case for Accelerating BFT Protocols Using In-Network Ordering

Guangda Sun, Xin Zhe Khooi, Yunfan Li, Mingliang Jiang, and Jialin Li

National University of Singapore

Abstract

Mission critical systems deployed in data centers today are

facing more sophisticated failures. Byzantine fault tolerant

(BFT) protocols are capable of masking these types of fail-

ures, but are rarely deployed due to their performance cost

and complexity. In this work, we propose a new approach

to designing high performance BFT protocols in data cen-

ters. By re-examining the ordering responsibility between

the network and the BFT protocol, we advocate a new ab-

straction offered by the data center network infrastructure.

Concretely, we design a new authenticated ordered multi-

cast primitive (AOM) that provides transferable authentication

and non-equivocation guarantees. Feasibility of the design is

demonstrated by two hardware implementations of AOM– one

using HMAC and the other using public key cryptography for

authentication – on new-generation programmable switches.

We then co-design a new BFT protocol, Matrix, that leverages

the guarantees of AOM to eliminate cross-replica coordination

and authentication in the common case. Evaluation results

show that Matrix outperforms state-of-the-art protocols on

both latency and throughput metrics by a wide margin, demon-

strating the beneﬁt of our new network ordering abstraction

for BFT systems.

1 Introduction

Online services today are commonly deployed in large data

centers and rely on fault-tolerance protocols to provide high

availability in the presence of failures. An important class of

fault-tolerance protocols is state machine replication (SMR).

SMR protocols [36,39,44,48,55] have long been deployed

in production systems [18,20,24,32] to ensure a set of dis-

tributed nodes behaves like a single, always available state

machine, despite failures of individual machines. These proto-

cols, however, can only tolerate node crash failures. In reality,

systems running in data centers are facing more sophisticated

failures. This is particularly relevant today, as permissioned

blockchain systems [3,8,50] are increasingly being deployed

in data centers for applications such as trading [2,4]. These

systems require tolerance to adversarial nodes and attacks.

Recent work [50] has shown that fault tolerance is becoming

their main performance bottleneck.

Numerous Byzantine fault tolerant (BFT) protocols [19,23,

38,47,56,58,60] have been proposed to handle arbitrary node

failures. Their strong failure models, however, come with sig-

niﬁcant performance implications. BFT protocols typically in-

cur rounds of replica communication coupled with expensive

cryptographic operations, resulting in low system throughput

and high request latency. To obtain higher throughput, many

BFT protocols rely on heavy request batching which leads

to long end-to-end decision latency – often in the range of

tens of milliseconds. Unfortunately, such latency overheads

are prohibitive for modern data center applications with strin-

gent service-level objectives (SLOs). Speculative BFT proto-

cols such as Zyzzyva [38] offer better commitment latency.

However, even a single faulty replica would eliminate the

performance beneﬁt of these latency-optimized protocols.

In this paper, we propose a new approach to building high-

performance BFT protocols in data centers. We observe that

traditional BFT protocols are designed with minimum as-

sumptions about the underlying network: The network only

provides best effort message delivery. The result of this weak

network model is that application-level protocols are respon-

sible for enforcing all correctness properties such as total

ordering, durability, and authentication. Our key insight is

that strengthening the network model, in which the network

can guarantee ordered message delivery, can lead to reduced

complexity and performance overhead of BFT protocols. Prior

work [42,43] have shown promises of an in-network sequenc-

ing approach for crash fault-tolerant systems. However, ex-

isting approaches fail to work once Byzantine failures are

introduced: Faulty nodes can disseminate conﬂicting mes-

sage orders, and the network sequencer may equivocate by

assigning different sequence numbers to each replica. In this

work, we propose a new network-level primitive, AOM, that

addresses the above challenges. AOM ensures that correct re-

ceivers always deliver multicast messages in the same order,

even in the presence of Byzantine participants. The key prop-

erty offered by AOM is transferable in-network authentication

arXiv:2210.12955v1 [cs.DC] 24 Oct 2022

– receivers can verify that a multicast message is properly

delivered by AOM, and they can prove authenticity of the

message to other receivers in the system. We additionally

propose a mixed failure model [45,54] in which the network

infrastructure can be either crash- or Byzantine-faulty. For de-

ployments that trust the network infrastructure, AOM provides

ordering guarantees with minimum network-level overhead.

For systems that need to tolerate Byzantine network devices,

AOM uses a simple round of cross-receiver communication to

handle equivocating sequencers.

We demonstrate the feasibility of our approach by imple-

menting our AOM primitive in off-the-shelf programmable

switches. The switch data plane performs both sequencing and

authentication for AOM messages. While packet sequencing

is relatively straightforward, generating secure authentication

codes – a complex mathematical procedure – is a major chal-

lenge given the switch’s limited resources and computational

constraints. We propose two implementations of in-switch

message authentication, each with trade-offs among switch

resource utilization, performance, and scalability. The ﬁrst

variant implements message authentication code (HMAC)

vectors in the switch ASICs using SipHash [9] as the hash-

ing algorithm. The second variant generates signatures using

public-key cryptography. Due to hardware constraints, di-

rect in-switch implementation of cryptographic algorithms

such as RSA [52] and ECDSA [35] remains infeasible. We

design a new heterogeneous switch architecture that tightly

couples FPGA-based cryptographic accelerators to the switch

pipelines. Our design enables efﬁcient in-network processing

and signing of AOM messages, scales linearly with the number

of cryptographic accelerators attached, and requires minimum

hardware resources on the switch data plane.

Leveraging the strong properties provided by AOM, we co-

design a new BFT protocol, Matrix. In the common case,

Matrix replicas rely on the ordering guarantees of AOM to

commit client requests in a single round trip, eliminating all

cross replica communication and authentication. Moreover,

Matrix stays in this fast path protocol even in the presence of

(up to

) faulty replicas, while requiring the theoretical mini-

mum replication factor (

3f+1

). In the rare case of network

failures, we design efﬁcient protocols to handle packet drops

and faulty switch sequencers while guaranteeing correctness.

By evaluating against state-of-the-art BFT protocols, we show

that Matrix can improve protocol throughput by up to 3.4

and end-to-end latency by 42

, demonstrating the beneﬁt

of our authenticated in-network ordering approach for BFT

systems.

2 Background

In this section, we give an overview of state-of-the-arts

BFT protocols. We then review recent proposals that use in-

network ordering to accelerate crash fault-tolerant systems.

2.1 State-of-the-Art BFT Protocols

There has been a long line of work on BFT state machine

replication (SMR) protocols. Table 1 presents a summary

of the properties and comparison of some recent represen-

tative protocols. PBFT [19] is the ﬁrst practical BFT proto-

col that tolerates up to

Byzantine nodes, requiring at least

3f+1

replicas which has been shown to be theoretical lower

bound [17]. In PBFT, client requests are committed in ﬁve

message delays: clients send requests to a primary replica; the

primary replica sequences and forwards the requests to the

backup replicas; backup replicas authenticate the requests and

broadcast their acceptance; after a replica receives quorum

acceptance, it broadcasts a commit decision; replicas execute

the request and reply to the client once they collect quorum

commit decisions. As replicas exchange messages in an all-

to-all fashion, each replica processes

O(N)

messages, making

the authenticator complexity O(N2).

Zyzzyva [38] speculatively executes client requests before

they are committed to reduce the communication overhead.

The protocol includes a fast path with three message delays

when clients receiving matching replies from all replicas, and

otherwise a slow path with at least ﬁve message delays. The

primary replica in Zyzzyva still sends signed messages to all

back up replicas (

O(N)

), but with all-to-all communication

removed, the authenticator complexity is reduced to

O(N)

Replicas in Zyzzyva may need to rollback speculatively exe-

cuted operations during view changes.

Unlike Zyzzyva which pushes the responsibility of collect-

ing authenticators to the client, SBFT [29] uses round-robin

message collector among all replicas to remove all-to-all com-

munication, similarly reducing authenticator complexity to

O(N)

. SBFT also leverages threshold signatures to reduce

message size, and simultaneously decreases the number of

client replies to one per decision.

Many BFT protocols ( [19,29,38]) use an expensive view

change protocol to handle primary leader failure. The standard

view change protocol used in PBFT requires

O(N3)

message

authenticators, limiting the overall scalability. HotStuff [58]

addresses this issue by adding an extra phase in normal oper-

ation. This design reduces the authenticator complexity of the

leader failure protocol to

O(N)

, matching that of the normal

case protocol. In return, HotStuff adds a one-way network

latency to the request commit delay.

BFT with trusted components.

To reduce protocol com-

plexity, a recent line of work [23,25,41,56,61] introduces

trusted components on each replica. These trusted compo-

nents can be implemented in a Trusted Platform Module

(TPM) [5] or run in a trusted hypervisor, and are assumed to

always behave correctly even if residing on Byzantine nodes.

A2M-PBFT-EA [23] uses an attested append-only memory

(A2M) to securely store operations as entries in a log. Each

A2M log entry is associated with a monotonically increasing,

gap-less sequence number. Once appended, a log entry be-

PBFT [19] Zyzzyva [38] SBFT [29] HotStuff [58] A2M [23] MinBFT [56] Matrix

Replication Factor 3 f+1 3 f+1 3 f+1 3 f+1 2 f+1 2 f+1 3 f+1

Bottleneck Complexity O(N)O(N)O(N)O(N)O(N)O(N)O(1)

Authenticator Complexity O(N2)O(N)O(N)O(N)O(N2)O(N2)O(N)

Message Delay 5 3 6 4 5 4 2

Table 1: Comparison of Matrix to state-of-the-art BFT protocols. Here, bottleneck complexity denotes the number of messages

the bottleneck replica needs to process; authenticator complexity shows the total number of signatures processed by all replicas.

comes immutable, and its content can be attested by any node

in the system. A2M enables replicas to eliminate equivocation,

and thus, reduce the replication factor to

2f+1

. However, the

protocol still incurs the same bottleneck complexity, authenti-

cator complexity, and message latency as PBFT. A later work,

TrInc [41], reduces the trusted component to a single counter,

and leaves the log in untrusted memory.

MinBFT [56] features a message-based trusted primitive

called Unique Sequential Identiﬁer Generator (USIG). For

each input message, USIG generates a unique identiﬁer that

is monotonic, sequential, and veriﬁable. By authenticating

USIG identiﬁers, MinBFT replicas can validate that all other

replicas have received the same messages in the same order.

This property allows the protocol to reduce the message delay

to four. Unfortunately, MinBFT’s authenticator complexity

remains at O(N2).

2.2 In-Network Ordering for CFT Protocols

Another recent line of work [42,43,49] proposes a new ap-

proach to designing crash fault tolerant (CFT) SMR protocols.

These systems move the responsibility of request ordering to

the data center network, so application-level protocols only

ensure durability of client operations. This network co-design

approach improves SMR protocol performance by reducing

coordination overhead among servers to commit an operation.

For instance, NOPaxos [43] dedicates a programmable switch

in the network as a sequencer. The switch stamps sequence

numbers to each NOPaxos request, and all replicas execute

requests in the same sequence number order. However, all

these solutions only target CFT protocols and they fail to work

if participants can exhibit Byzantine faults, e.g., a Byzantine

node can impersonate the sequencer and broadcast conﬂicting

message orders.

3 Authenticated In-Network Ordering

The core of a BFT SMR protocol is to establish a consistent or-

der of requests even in the presence of failures. Traditionally,

this task is accomplished by explicit communication among

the replicas, usually coordinated by a leader. In this work, we

argue for a new approach to improving the efﬁciency of BFT

protocols, one which the responsibility of request ordering is

moved to the network infrastructure.

3.1 The Case for an Authenticated Ordering Service in

Data Center Networks

To guarantee linearizability [30], BFT SMR protocols require

all non-faulty replicas to execute client requests in the same

order. Due to the best-effort assumption about the network, an

application-level protocol is fully responsible for establishing

a total order of requests among the replicas. Take PBFT [19]

as an example: The primary replica ﬁrst assigns an order to

client requests before broadcasting to backup replicas. All

replicas then use two rounds of communication to agree on

this ordering while tolerating faulty participants. As discussed

in §2, even adding trusted components to each replica does

not alleviate the coordination and authentication overhead in

BFT protocols: replicas still require remote attestations to

verify the received messages.

What if the underlying network can offer stronger guar-

antees? Prior work [42,43,49] have already demonstrated

that in-network ordering, realized using network programma-

bility [16,33], can offer compelling performance beneﬁts to

crash fault tolerant SMR protocols. In this work, we argue

that BFT protocols can similarly beneﬁt from moving the

ordering responsibility to the network. At a high level, by of-

ﬂoading request ordering to the network, BFT replicas avoid

explicit communication to establish an execution order, effec-

tively reducing cross-replica coordination and authentication

overhead. This network ordering approach improves both pro-

tocol throughput – less work is performed on each replica –

and latency – fewer message delays are required to commit a

request.

However, achieving message ordering in the network is

more challenging in a BFT setting. In this section, we discuss

the main challenges faced by a BFT-based network ordering

primitive and our approach to address them.

Why authenticated ordering in the network?

In prior

network ordering systems, the network primitive (such as

the Ordered Unreliable Multicast in NOPaxos [43] and the

multi-sequenced groupcast in Eris [42]) is fully responsibility

for assigning an order to client requests. With non-Byzantine

participants, this network-level ordering is the only request

order observed by any replica. Unfortunately, in a BFT de-

ployment model, a faulty node can easily impersonate the

network primitive and assign a conﬂicting message order, vio-

lating the ordering guarantee of the network layer. To tolerate

equivocating Byzantine participants, we augment the network

primitive to provide an authentication property: Non-faulty

replicas can independently validate that the received mes-

sage order is indeed established by the network, not by any

faulty node. We elaborate on how such authentication can be

implemented efﬁciently on commodity switch hardware in

§4.

Hybrid fault model and Byzantine network.

If the net-

work itself is Byzantine, it can equivocate by assigning differ-

ent message orders to different replicas, violating the ordering

guarantee. In this work, we argue for a dual fault model –

one fault model for the participating nodes and one for the

network. Our argument is inspired by prior work that pro-

poses hybrid fault model [54] and work [45] that separates

machine faults from network faults. Speciﬁcally, we always

assume a Byzantine failure mode for end-hosts. The network

infrastructure, on the other hand, can either be crash-faulty or

Byzantine-faulty. The beneﬁt of our approach is that we offer

ﬂexibility with an explicit trade-off between fault tolerance

and performance. For deployments that trust the network to

only exhibit crash and omission faults, our solution offers the

optimal performance; if the deployment assumes the network

infrastructure can behave arbitrarily, we offer solution that

tolerates Byzantine faults in the network, albeit taking a small

performance penalty.

We further argue that a hybrid fault model, in which the

network is assumed to be crash-faulty, is a practical option for

many systems deployed in data centers. Networking hardware

presents a smaller attack surface and is less vulnerable to bugs

compared to software-based components. They are single

application ASICs without sophisticated system software,

and formal veriﬁcation of their hardware designs are common

practice. Systems running in data centers also inherently place

some trust in the data center hardware infrastructure. It is

also against the economic interest of data center operators to

violate the trust of their provided services.

Our model resembles existing deployment options in the

public cloud: Only deployments that do not trust the cloud

infrastructure run their virtual machines on instances with

a Trusted Execution Environment (TEE) such as Intel SGX.

The majority of use cases place trust on cloud hardware and

hypervisors; in return, they attain higher performance com-

pared to their TEE counterpart.

3.2 Authenticated Ordered Multicast

We have so far argued for an authenticated ordering service

in the network for BFT protocols. To that end, we propose a

new Authenticated Ordered Multicast (AOM) primitive as a

concrete instance of such model. Similar to other multicast

primitives (e.g., IP multicast), an AOM deployment consists

of one or multiple AOM groups, each identiﬁed by a unique

group address. AOM receivers can join and leave a AOM group

by contacting a membership service. A sender sends an AOM

message to one AOM group, and the network is responsi-

ble for routing the message to all group receivers. Note that

senders do not know the identity nor the address of individual

receivers; they only specify the group address as the destina-

tion.

Unlike traditional best-effort IP multicast, AOM provides a

set of stronger guarantees, which we formally deﬁne here:

•Asynchrony.

There is no bound on the delivery latency of

AOM messages.

•Unreliability.

There is no guarantee that an AOM message

will be received by any receiver in the destination group.

•Authentication.

A receiver can verify the authenticity of

an AOM message, i.e., the message is correctly processed by

the AOM network primitive. A correct receiver only delivers

authentic AOM messages.

•Transferable Authentication.

If a receiver

forwards an

AOM message to another receiver

can independently

verify the authenticity of the message.

•Ordering.

For any two authentic AOM messages

and

that destined to the same AOM group

, all correct receivers

that receive both

and

deliver them in the same

order.

•Drop Detection.

Receivers can detect AOM message drops

in the network and deliver DROP-NOTIFICATION. Formally,

for any authentic AOM message

, either 1) all correct re-

ceivers in the destination group

delivers

or a DROP-

NOTIFICATION for

before delivering the next AOM mes-

sage, or 2) none of the correct receivers in

delivers

aDROP-NOTIFICATION for m.

A key differentiating property of AOM is the capability of

receivers to independently verify the authenticity of an AOM

message. Authenticity in our case does not refer to the identity

of the sender, which still requires end-to-end cryptography.

Instead, receivers can verify that a message is correctly pro-

cessed by the AOM primitive, and the ordering is not forged by

other participant in the system. The authentication capability

is also transferable: an AOM message can be relayed to any

other receiver in the group, and they can independently verify

the authenticity of the message.

Another important guarantee of AOM is non-equivocation,

i.e., the primitive never delivers conﬂicting order of messages

to non-faulty receivers in a group. As we show in §4, if the

network is assumed to be Byzantine-faulty, the primitive re-

quires additional conﬁrmation among the group receivers to

guarantee non-equivocation. AOM, however, does not guar-

antee reliable transmission. The implication is that an AOM

message may be delivered by only a subset of the receivers in

a group. AOM ensures that receivers who miss the message

deliver a DROP-NOTIFICATION. The application-level BFT

protocol is then responsible for reaching consensus on the

fate of those dropped messages.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TheCaseforAcceleratingBFTProtocolsUsingIn-NetworkOrderingGuangdaSun,XinZheKhooi,YunfanLi,MingliangJiang,andJialinLiNationalUniversityofSingaporeAbstractMissioncriticalsystemsdeployedindatacenterstodayarefacingmoresophisticatedfailures.Byzantinefaulttolerant(BFT)protocolsarecapableofmaskingthesetypes...

展开>> 收起<<

The Case for Accelerating BFT Protocols Using In-Network Ordering Guangda Sun Xin Zhe Khooi Yunfan Li Mingliang Jiang and Jialin Li National University of Singapore.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The Case for Accelerating BFT Protocols Using In-Network Ordering Guangda Sun Xin Zhe Khooi Yunfan Li Mingliang Jiang and Jialin Li National University of Singapore

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: