Medha Microcoded Hardware Accelerator for Processing Encrypted Data

2025-04-24 0 0 2.18MB 38 页 10玖币

侵权投诉

Medha: Microcoded Hardware Accelerator for

computing on Encrypted Data

Ahmet Can Mert1, Aikata1, Sunmin Kwon2, Youngsam Shin2, Donghoon

Yoo2, Yongwoo Lee2and Sujoy Sinha Roy1

1IAIK, Graz University of Technology, Graz, Austria

{ahmet.mert,aikata,sujoy.sinharoy}@iaik.tugraz.at

2Samsung Advanced Institute of Technology, Suwon, Republic of Korea

{sunmin7.kwon,youngsam.shin,say.yoo,yw0803.lee}@samsung.com

Abstract.

Homomorphic encryption enables computation on encrypted data, and hence it has

a great potential in privacy-preserving outsourcing of computations to the cloud.

Hardware acceleration of homomorphic encryption is crucial as software implemen-

tations are very slow. In this paper, we present design methodologies for building

a programmable hardware accelerator for speeding up the cloud-side homomorphic

evaluations on encrypted data.

First, we propose a divide-and-conquer technique that enables homomorphic evalua-

tions in the polynomial ring

RQ,2N

[

]

(

x2N

+ 1) to use a hardware accelerator

that has been built for the smaller ring

RQ,N

[

]

(

+ 1). The technique

makes it possible to use a single hardware accelerator ﬂexibly for supporting several

homomorphic encryption parameter sets.

Next, we present several architectural design methods that we use to realize the

ﬂexible and instruction-set accelerator architecture, which we call ‘Medha’. At every

level of the implementation hierarchy, we explore possibilities for parallel processing.

Starting from hardware-friendly parallel algorithms for the basic building blocks, we

gradually build heavily parallel RNS polynomial arithmetic units. Next, many of these

parallel units are interconnected elegantly so that their interconnections require the

minimum number of nets, therefore making the overall architecture placement-friendly

on the platform. As homomorphic encryption is computation- as well as data-centric,

the speed of homomorphic evaluations depends greatly on the way the data variables

are handled. For Medha, we take a memory-conservative design approach and get rid

of any oﬀ-chip memory access during homomorphic evaluations.

Finally, we implement Medha in a Xilinx Alveo U250 FPGA and measure timing

performances of the microcoded homomorphic addition, multiplication, key-switching,

and rescaling routines for the leveled fully homomorphic encryption scheme RNS-

HEAAN at 200 MHz clock frequency. For the large parameter sets (

log Q, N

) =

(438

)and (546

), Medha achieves accelerations by up to 68

and 78

times

respectively compared to a highly optimized software implementation Microsoft SEAL

running at 2.3 GHz. 1

Keywords: Homomorphic Encryption, CKKS, Flexible, Hardware Accelerator.

1 Introduction

Cloud computing services are very popular and provide high-performance computational

resources to the users [

AFG+10

]. Despite its advantages, conventional cloud computing has

1Paper will appear at IACR Transactions on Cryptographic Hardware and Embedded Systems 2023.

arXiv:2210.05476v2 [cs.CR] 12 Oct 2022

2 Medha: Microcoded Hardware Accelerator for Processing Encrypted Data

security and privacy risks as the data of the user, becomes visible (as plaintext) during any

computation in the cloud. Isolation techniques are followed with certain trust assumptions.

Yet, in recent years several data leaks have been reported.

Fully Homomorphic Encryption (FHE) [

RAD78

] enables logical and arithmetic opera-

tions on encrypted data without requiring any decryption of the data. Clients can keep

their data in encrypted format in the cloud and perform computations, e.g., inference, pre-

diction, statistics, etc., directly on the encrypted data without revealing their conﬁdential

data. Similarly, the cloud-based service providers do not need to reveal their trade secrets,

e.g., models or methods, to the clients. If a service provider keeps the model or method

secret, a client cannot run the computation locally on her plaintext data. In this way,

FHE solves conﬂicting privacy needs of users and service providers.

In 2009, Gentry constructed the ﬁrst FHE scheme [

Gen09

]. FHE quickly gained interest

from both academia and industry. During the last 10 years, faster and faster FHE schemes

started appearing with orders of magnitude improvements in performance. There are

several FHE or leveled FHE schemes in the literature. The diﬀerence between an FHE and

a leveled-FHE is that the latter could perform computations correctly only up to a certain

complexity level whereas the ﬁrst one could do arbitrary computations. It is possible to

transform a leveled-FHE into an FHE by introducing a special procedure ‘bootstrapping’

that refreshes ciphertexts. This paper focuses on the hardware acceleration aspects of

leveled-FHE. In the remaining part of the paper, we will use the term HE to represent

leveled-FHE. For evaluating arithmetic operations homomorphically, BFV [

FV12

] and

BGV [

BGV11

] are popular. TFHE [

CGGI20

] is eﬃcient for evaluating Boolean gates.

For performing computations on encrypted real numbers, HEAAN [

CKKS17

] and its

Residue Number System (RNS) variant RNS-HEAAN [

CHK+18

] are eﬃcient. In fact,

RNS-HEAAN is the fastest scheme for performing approximate computations on the

encrypted real data.

1.1 Related hardware acceleration works and motivation

Although a decade of research in algorithmic and mathematical optimizations has made HE

schemes orders of magnitude faster than their ﬁrst-generation counterparts, homomorphic

evaluations in software are four to ﬁve orders of magnitude slower than equivalent compu-

tations on the plaintext. Therefore, the hardware acceleration of HE is crucial in reducing

this performance gap. While it is true that developing an accelerator for homomorphic

evaluation will take a longer design-time and more eﬀort than writing a software program,

the reduction in computation time and energy consumption in the cloud server will be

huge in long-term.

Earliest reported FPGA-base accelerator [

WH13

] and ASIC-based accelerator [

WHEW14

]

targeted speeding up the 768K-bit modular multiplication of the ﬁrst generation integer-

based homomorphic encryption scheme [

GH11

]. In the following years several accelerator

architectures [

RJV+15

TRG+20

RNH19

RTJ+20

FSK+21

XZH21

KKK+22

SFK+22

]

have implemented selected building blocks of homomorphic encryption on ASIC and FPGA

platforms, or presented simulation-based performance estimates without making real pro-

totypes. While such works indicate that ASIC and FPGA platforms have the potential

to accelerate HE, they do not fully capture the engineering challenges that appear only

in the actual hardware implementations, especially when the architectures are very large.

Therefore, simulation-based performance estimates may change greatly when real hardware

accelerators for homomorphic encryption are made.

Before we describe related hardware acceleration works for cloud-side homomorphic com-

puting, we would like to mention that hardware acceleration could also be used to speedup

client-side encryption and decryption operations. One example accelerator is [

MOS20

Compared to homomorphic evaluation on encrypted data, homomorphic encryption and

decryption are much simpler and less frequent. They are quite similar to simple ring-LWE

Mert, Aikata, Kwon, Shin, Yoo, Lee, Sinha Roy 3

encryption and decryption used in lattice-based post-quantum cryptography. In homomor-

phic encryption the real bottleneck is the slowness of cloud-side homomorphic evaluations.

Therefore, in this paper we focus only on the hardware acceleration of cloud-side evalua-

tions. Readers may also study symmetric-homomorphic hybrid protocols where a client

simply encrypts the data using an homomorphic encryption friendly block-cipher such as

Pasta [

DGH+21

] to save communication bandwidth. Thereafter, the cloud evaluates the

expensive block-cipher decryption homomorphically before computing the actual task.

Recently, several papers [

FSK+21

KKK+22

SFK+22

KLK+22

] proposed ASIC-based

high-end accelerator architectures and claimed three to four orders of magnitude speedups

with respect to software for performing homomorphic evaluations. These works use

simulation and logic synthesis for obtaining performance and area estimates respectively,

without going through the complete ASIC design ﬂow or fabricating a real ASIC chip.

Following the chip fabrication price estimates [

MUS

], fabricating these ASIC chips will

require millions of dollars of investments.

To the best of our knowledge, CoFHEE [

NSA+22

] is the only ASIC accelerator that

has been fabricated and proven in silicon. CoFHEE’s total die area is 15 mm

and it

accelerates homomorphic evaluations only up to 2.5

compared to the SEAL software

library. In their year-long eﬀort to design the chip, the authors follow the complete ASIC

design ﬂow, implement a custom clock distribution network, perform pre-silicon veriﬁcation

using simulation and FPGA prototyping, and ﬁnally perform post-silicon validation to

know that their ASIC chip works correctly. The authors of CoFHEE raise concerns about

the feasibility of F1 [FSK+21] in silicon (see Sec. 9 of [NSA+22]).

Although FPGAs are slower than ASIC platforms, their relatively shorter design cycle,

re-programmability to ﬁx bugs easily, reusability, and signiﬁcantly cheaper price make

FPGAs popular for implementing performance-critical algorithms. The FPGA-based

programmable accelerators [

SRTJ+19

TRV20

] demonstrated latency reductions by one

order compared to software implementations. ‘HEAX’ [

RLPD20

] obtained more than

two orders of magnitude throughput with respect to software implementations using one

Intel FPGA. While the speedup is impressive, a limitation of HEAX is that it is not

programmable and its block-pipelined architecture was designed speciﬁcally for the key-

switching of RNS-HEAAN. Contrary to HEAX, the programmable accelerator [

SRTJ+19

]

uses the same computational resources to execute several homomorphic evaluation routines.

While programmability is a desired feature in accelerators, the one order speedup of

HEAX [

RLPD20

] over the programmable processor [

SRTJ+19

TRV20

] may give an

impression that block-pipelined and speciﬁcally optimized accelerators are signiﬁcantly

superior to ﬂexible accelerators for HE.

In this work, we dig deep into architectural explorations to see if programmable

and ﬂexible accelerator architectures can be built without sacriﬁcing performance. The

availability of a programmable accelerator will make it possible to run and accelerate several

types of homomorphic evaluation routines without requiring a new accelerator architecture.

However, developing a real prototype of a ﬂexible and high-performance accelerator for

homomorphic encryption is full of design challenges. This motivates us to see how far

we can push homomorphic computing on encrypted data in practice using programmable

hardware. Sharing the experiences and methodologies for designing a real high-performance

accelerator will help the research community identify the actual engineering challenges as

well as future research directions for potential performance improvements.

Another research gap is the lack of a parameter-ﬂexible accelerator. Homomorphic

applications of diﬀerent complexities (multiplicative depths) demand diﬀerent parameter

sets. Hence, supporting several parameter sets is another important yet currently unfulﬁlled

requirement for the cloud-side accelerators. Almost all of the reported accelerators [

RJV+18

SRTJ+19

RLPD20

TRV20

] have been designed for speciﬁc parameter sets and they lack

the ﬂexibility to support more than one parameter set. That motivates us to design a

4 Medha: Microcoded Hardware Accelerator for Processing Encrypted Data

programmable and parameter-ﬂexible accelerator for homomorphic encryption.

1.2 Contributions

To address the above-mentioned research gaps, we design a programmable and parameter-

ﬂexible hardware accelerator architecture ‘Medha’ and implement it in a Xilinx Alveo U250

Card. Medha accelerates the homomorphic addition, multiplication, key-switching, re-

scaling, and rotation operations of the RNS-HEAAN [

CHK+18

] scheme for large parameter

sets by around two orders of magnitude compared to software implementations.

The main contributions of our paper reside at both algorithmic and architecture levels.

First, we propose a design methodology that oﬀers the ﬂexibility to support several

polynomial degrees using a ﬁxed hardware accelerator. It eﬃciently performs homomorphic

computations on ‘large-degree’ ciphertext-polynomials using compute units that have been

optimized to handle ‘small-degree’ ciphertext-polynomials. Therefore, several homomorphic

applications can be accelerated using the same architecture. Additionally, the proposed

methodology reduces the on-chip memory and logic requirements.

On the architecture side, our main contributions are in the high-levels of the implemen-

tation hierarchy where diﬀerent compute and memory elements are organized. To compute

the arithmetic of residue polynomials, we design a novel Residue Polynomial Arithmetic

Unit (RPAU) pragmatically. Our RPAU contains a multi-core Number Theoretic Trans-

form (NTT) unit for polynomial multiplication, two parallel sets of dyadic arithmetic units,

and a customized on-chip memory for storing operand and resultant residue polynomials.

The designed RPAU is an instruction-set architecture. We can execute dyadic arithmetic

and NTT instructions in parallel. This parallelism is very useful in minimizing the cycle

count of key-switching operation, which is the costliest subroutine in HE. We observe

around 40% reduction in the latency at the cost of around 20% increase in the area.

A memory-conservative design approach is followed to save on-chip memory elements

for useful computations. A customized on-chip memory is designed to store residue

polynomials inside the RPAU. Even for the largest supported polynomial degree with

coeﬃcients, the on-chip memory is able to store all the residue polynomials during

a homomorphic multiplication and key-switching, therefore eliminating the need for any

oﬀ-chip data exchange during a computation (which is very slow). We are the ﬁrst to report

fully on-chip computation of the two HE subroutines for such large-degree polynomials.

At the highest level of the implementation hierarchy, several RPAUs are instantiated

and interconnected. The parallel RPAUs must perform data exchanges between themselves

during the modulus switching steps of the key-switching and rescaling operations. Trivially

connecting every RPAU to the remaining RPAUs demands a quadratic number of nets

and makes actual implementation infeasible when there are several large RPAUs in the

architecture. Therefore, ﬁnding an optimal way of interconnecting the RPAUs is critical

due to two main reasons. Firstly, because each RPAU consumes a large area and has

thousands of bits of input/output ports, their placement on the design platform becomes a

challenging engineering problem. Previous works, e.g., [

RJV+15

RJV+18

] bypassed that

engineering problem by performing data exchanges happen via a shared oﬀ-chip memory

at a great performance cost. Secondly, if the data transfer rate is compromised to reduce

the number of interconnects, then there will be a drastic impact on the performance of

key-switching and rescaling. After studying diﬀerent ways of interconnecting the RPAUs,

we propose a ‘ring’ styled interconnection with eﬃcient scheduling of data transfers. The

ring reduces the number of interconnects to a linear complexity without introducing any

performance loss. We observe that the proposed ring is crucial for making large HE

implementations feasible on SLR-based very large Xilinx FPGAs.

Besides the above-mentioned main contributions, we make optimizations at the lower

levels of the implementation hierarchy where polynomial and coeﬃcient operations are

performed. We implement a uniﬁed and multi-core NTT-based multiplier with optimal

Mert, Aikata, Kwon, Shin, Yoo, Lee, Sinha Roy 5

scheduling for memory reads and writes. We use hardware-friendly parameters (e.g.,

word-size, primes, etc.) and parallel algorithms to perform fast modular arithmetic.

Organization:

The paper is organized as follows. Sec. 2presents a brief background.

Next, we present the proposed ﬂexible design methodology for the polynomial degree

and its applications to the RNS-HEAAN scheme in Sec. 3. In Sec. 4, the proposed

ﬂexible instruction-set accelerator is presented. The accelerator architecture is realized

hierarchically starting from low-level polynomial arithmetic units in Sec. 4.2 and then

organizing diﬀerent compute and memory elements in Sec. 4.3. Detailed experimental

results and comparisons are provided in Sec. 5and the ﬁnal section draws the conclusions.

2 Background

2.1 Notation

Let

represent the ring integers in the range [0

, Q −

1]. Any modular reduction by a

modulus

in [0

, Q −

1] is denoted as [

]

. The polynomial ring

RQ,N

[

]

(

+ 1)

contains polynomials of degree at most

N−

1with coeﬃcients in

. An integer is repre-

sented using a normal font and lowercase letter, e.g.,

a∈ZQ

. A polynomial is represented

using a bold font and lowercase letter,

a∈RQ,N

. When a residue number system (RNS)

is used with a composite modulus

QL−1

i=0 qi

, a polynomial in

RQ,N

becomes a vector

of residue polynomials in the RNS. Let

[

]

∈Rqi,N

be the

-th residue polynomial in the

RNS representation of

a∈RQ,N

Q,N

represents a

-tuple of polynomials from

RQ,N

Thereafter, the

-th residue polynomial of the

-th member from the tuple

a∈Rk

Q,N

represented as a[j][i]∈Rqi,N .

The Number Theoretic Transform (NTT) of a polynomial

is represented by

. The

multiplication between two elements of a ring is denoted by the

operator. The coeﬃcient-

wise multiplication between two polynomials is denoted by the

operator. Multiplying all

the coeﬃcients of a polynomial aby an integer scalar cis denoted by acor ca.

2.2 Homomorphic Encryption

In a typical homomorphic encryption protocol, there are two parties: a client and a cloud

server. The cloud contains data encrypted (i.e., ciphertext) by the client, and the client

performs computations on its encrypted data in the cloud. At the end of computations,

the client receives the encrypted results from the cloud and performs decryptions locally

to recover the plaintext results. The encryption and decryption operations are performed

using the secret-key and private-key of the client, respectively.

Several ideal lattice-based homomorphic encryption schemes, e.g., BGV [

BGV11

BFV [

FV12

], and HEAAN [

CHK+18

] use the following framework. Let, a client’s secret-

key be

= (1

)

∈R2

Q,N

and the corresponding public-key be

= (

b,a

)

∈R2

Q,N

. Each

key is a pair of polynomials in the polynomial ring

RQ,N

where

is the coeﬃcient-modulus

and

is the polynomial ring degree. Client encrypts a message

using

and obtains

the ciphertext

ct ←

(

r·b

m,c1

r·a

)

∈R2

Q,N

where

is a Gaussian

distributed error-polynomial and

is a uniformly random polynomial. Let, a cloud contains

two ciphertexts

= (

c0,c1

)and

ct0

= (

0,c0

)

∈R2

Q,N

of the client as the encrypted

messages

and

respectively. The cloud can compute a valid encryption of

simply by adding the two ciphertexts as

ctadd ←

(

0,c1

)

∈R2

Q,N

. Computing an

encryption of

m·m0

is relatively complex, scheme speciﬁc and involves several steps. First,

the two ciphertexts are multiplied to obtain

ctmult

= (

c0·c0

0,c0·c0

c1·c0

0,c1·c0

)

∈R3

Q,N

This intermediate result has three polynomial components and could be decrypted using

,s,s2

)but not using

= (1

). Next, a special operation known as the ‘Key-Switching’,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Medha:MicrocodedHardwareAcceleratorforcomputingonEncryptedDataAhmetCanMert1,Aikata1,SunminKwon2,YoungsamShin2,DonghoonYoo2,YongwooLee2andSujoySinhaRoy11IAIK,GrazUniversityofTechnology,Graz,Austria{ahmet.mert,aikata,sujoy.sinharoy}@iaik.tugraz.at2SamsungAdvancedInstituteofTechnology,Suwon,RepublicofK...

收起<<

Medha Microcoded Hardware Accelerator for Processing Encrypted Data.pdf

共38页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Medha Microcoded Hardware Accelerator for Processing Encrypted Data

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: