Medha Microcoded Hardware Accelerator for Processing Encrypted Data

2025-04-24 0 0 2.18MB 38 页 10玖币
侵权投诉
Medha: Microcoded Hardware Accelerator for
computing on Encrypted Data
Ahmet Can Mert1, Aikata1, Sunmin Kwon2, Youngsam Shin2, Donghoon
Yoo2, Yongwoo Lee2and Sujoy Sinha Roy1
1IAIK, Graz University of Technology, Graz, Austria
{ahmet.mert,aikata,sujoy.sinharoy}@iaik.tugraz.at
2Samsung Advanced Institute of Technology, Suwon, Republic of Korea
{sunmin7.kwon,youngsam.shin,say.yoo,yw0803.lee}@samsung.com
Abstract.
Homomorphic encryption enables computation on encrypted data, and hence it has
a great potential in privacy-preserving outsourcing of computations to the cloud.
Hardware acceleration of homomorphic encryption is crucial as software implemen-
tations are very slow. In this paper, we present design methodologies for building
a programmable hardware accelerator for speeding up the cloud-side homomorphic
evaluations on encrypted data.
First, we propose a divide-and-conquer technique that enables homomorphic evalua-
tions in the polynomial ring
RQ,2N
=
ZQ
[
x
]
/
(
x2N
+ 1) to use a hardware accelerator
that has been built for the smaller ring
RQ,N
=
ZQ
[
x
]
/
(
xN
+ 1). The technique
makes it possible to use a single hardware accelerator flexibly for supporting several
homomorphic encryption parameter sets.
Next, we present several architectural design methods that we use to realize the
flexible and instruction-set accelerator architecture, which we call ‘Medha’. At every
level of the implementation hierarchy, we explore possibilities for parallel processing.
Starting from hardware-friendly parallel algorithms for the basic building blocks, we
gradually build heavily parallel RNS polynomial arithmetic units. Next, many of these
parallel units are interconnected elegantly so that their interconnections require the
minimum number of nets, therefore making the overall architecture placement-friendly
on the platform. As homomorphic encryption is computation- as well as data-centric,
the speed of homomorphic evaluations depends greatly on the way the data variables
are handled. For Medha, we take a memory-conservative design approach and get rid
of any off-chip memory access during homomorphic evaluations.
Finally, we implement Medha in a Xilinx Alveo U250 FPGA and measure timing
performances of the microcoded homomorphic addition, multiplication, key-switching,
and rescaling routines for the leveled fully homomorphic encryption scheme RNS-
HEAAN at 200 MHz clock frequency. For the large parameter sets (
log Q, N
) =
(438
,
2
14
)and (546
,
2
15
), Medha achieves accelerations by up to 68
×
and 78
×
times
respectively compared to a highly optimized software implementation Microsoft SEAL
running at 2.3 GHz. 1
Keywords: Homomorphic Encryption, CKKS, Flexible, Hardware Accelerator.
1 Introduction
Cloud computing services are very popular and provide high-performance computational
resources to the users [
AFG+10
]. Despite its advantages, conventional cloud computing has
1Paper will appear at IACR Transactions on Cryptographic Hardware and Embedded Systems 2023.
arXiv:2210.05476v2 [cs.CR] 12 Oct 2022
2 Medha: Microcoded Hardware Accelerator for Processing Encrypted Data
security and privacy risks as the data of the user, becomes visible (as plaintext) during any
computation in the cloud. Isolation techniques are followed with certain trust assumptions.
Yet, in recent years several data leaks have been reported.
Fully Homomorphic Encryption (FHE) [
RAD78
] enables logical and arithmetic opera-
tions on encrypted data without requiring any decryption of the data. Clients can keep
their data in encrypted format in the cloud and perform computations, e.g., inference, pre-
diction, statistics, etc., directly on the encrypted data without revealing their confidential
data. Similarly, the cloud-based service providers do not need to reveal their trade secrets,
e.g., models or methods, to the clients. If a service provider keeps the model or method
secret, a client cannot run the computation locally on her plaintext data. In this way,
FHE solves conflicting privacy needs of users and service providers.
In 2009, Gentry constructed the first FHE scheme [
Gen09
]. FHE quickly gained interest
from both academia and industry. During the last 10 years, faster and faster FHE schemes
started appearing with orders of magnitude improvements in performance. There are
several FHE or leveled FHE schemes in the literature. The difference between an FHE and
a leveled-FHE is that the latter could perform computations correctly only up to a certain
complexity level whereas the first one could do arbitrary computations. It is possible to
transform a leveled-FHE into an FHE by introducing a special procedure ‘bootstrapping’
that refreshes ciphertexts. This paper focuses on the hardware acceleration aspects of
leveled-FHE. In the remaining part of the paper, we will use the term HE to represent
leveled-FHE. For evaluating arithmetic operations homomorphically, BFV [
FV12
] and
BGV [
BGV11
] are popular. TFHE [
CGGI20
] is efficient for evaluating Boolean gates.
For performing computations on encrypted real numbers, HEAAN [
CKKS17
] and its
Residue Number System (RNS) variant RNS-HEAAN [
CHK+18
] are efficient. In fact,
RNS-HEAAN is the fastest scheme for performing approximate computations on the
encrypted real data.
1.1 Related hardware acceleration works and motivation
Although a decade of research in algorithmic and mathematical optimizations has made HE
schemes orders of magnitude faster than their first-generation counterparts, homomorphic
evaluations in software are four to five orders of magnitude slower than equivalent compu-
tations on the plaintext. Therefore, the hardware acceleration of HE is crucial in reducing
this performance gap. While it is true that developing an accelerator for homomorphic
evaluation will take a longer design-time and more effort than writing a software program,
the reduction in computation time and energy consumption in the cloud server will be
huge in long-term.
Earliest reported FPGA-base accelerator [
WH13
] and ASIC-based accelerator [
WHEW14
]
targeted speeding up the 768K-bit modular multiplication of the first generation integer-
based homomorphic encryption scheme [
GH11
]. In the following years several accelerator
architectures [
RJV+15
,
TRG+20
,
RNH19
,
RTJ+20
,
FSK+21
,
XZH21
,
KKK+22
,
SFK+22
]
have implemented selected building blocks of homomorphic encryption on ASIC and FPGA
platforms, or presented simulation-based performance estimates without making real pro-
totypes. While such works indicate that ASIC and FPGA platforms have the potential
to accelerate HE, they do not fully capture the engineering challenges that appear only
in the actual hardware implementations, especially when the architectures are very large.
Therefore, simulation-based performance estimates may change greatly when real hardware
accelerators for homomorphic encryption are made.
Before we describe related hardware acceleration works for cloud-side homomorphic com-
puting, we would like to mention that hardware acceleration could also be used to speedup
client-side encryption and decryption operations. One example accelerator is [
MOS20
].
Compared to homomorphic evaluation on encrypted data, homomorphic encryption and
decryption are much simpler and less frequent. They are quite similar to simple ring-LWE
Mert, Aikata, Kwon, Shin, Yoo, Lee, Sinha Roy 3
encryption and decryption used in lattice-based post-quantum cryptography. In homomor-
phic encryption the real bottleneck is the slowness of cloud-side homomorphic evaluations.
Therefore, in this paper we focus only on the hardware acceleration of cloud-side evalua-
tions. Readers may also study symmetric-homomorphic hybrid protocols where a client
simply encrypts the data using an homomorphic encryption friendly block-cipher such as
Pasta [
DGH+21
] to save communication bandwidth. Thereafter, the cloud evaluates the
expensive block-cipher decryption homomorphically before computing the actual task.
Recently, several papers [
FSK+21
,
KKK+22
,
SFK+22
,
KLK+22
] proposed ASIC-based
high-end accelerator architectures and claimed three to four orders of magnitude speedups
with respect to software for performing homomorphic evaluations. These works use
simulation and logic synthesis for obtaining performance and area estimates respectively,
without going through the complete ASIC design flow or fabricating a real ASIC chip.
Following the chip fabrication price estimates [
MUS
], fabricating these ASIC chips will
require millions of dollars of investments.
To the best of our knowledge, CoFHEE [
NSA+22
] is the only ASIC accelerator that
has been fabricated and proven in silicon. CoFHEE’s total die area is 15 mm
2
and it
accelerates homomorphic evaluations only up to 2.5
×
compared to the SEAL software
library. In their year-long effort to design the chip, the authors follow the complete ASIC
design flow, implement a custom clock distribution network, perform pre-silicon verification
using simulation and FPGA prototyping, and finally perform post-silicon validation to
know that their ASIC chip works correctly. The authors of CoFHEE raise concerns about
the feasibility of F1 [FSK+21] in silicon (see Sec. 9 of [NSA+22]).
Although FPGAs are slower than ASIC platforms, their relatively shorter design cycle,
re-programmability to fix bugs easily, reusability, and significantly cheaper price make
FPGAs popular for implementing performance-critical algorithms. The FPGA-based
programmable accelerators [
SRTJ+19
,
TRV20
] demonstrated latency reductions by one
order compared to software implementations. ‘HEAX’ [
RLPD20
] obtained more than
two orders of magnitude throughput with respect to software implementations using one
Intel FPGA. While the speedup is impressive, a limitation of HEAX is that it is not
programmable and its block-pipelined architecture was designed specifically for the key-
switching of RNS-HEAAN. Contrary to HEAX, the programmable accelerator [
SRTJ+19
]
uses the same computational resources to execute several homomorphic evaluation routines.
While programmability is a desired feature in accelerators, the one order speedup of
HEAX [
RLPD20
] over the programmable processor [
SRTJ+19
,
TRV20
] may give an
impression that block-pipelined and specifically optimized accelerators are significantly
superior to flexible accelerators for HE.
In this work, we dig deep into architectural explorations to see if programmable
and flexible accelerator architectures can be built without sacrificing performance. The
availability of a programmable accelerator will make it possible to run and accelerate several
types of homomorphic evaluation routines without requiring a new accelerator architecture.
However, developing a real prototype of a flexible and high-performance accelerator for
homomorphic encryption is full of design challenges. This motivates us to see how far
we can push homomorphic computing on encrypted data in practice using programmable
hardware. Sharing the experiences and methodologies for designing a real high-performance
accelerator will help the research community identify the actual engineering challenges as
well as future research directions for potential performance improvements.
Another research gap is the lack of a parameter-flexible accelerator. Homomorphic
applications of different complexities (multiplicative depths) demand different parameter
sets. Hence, supporting several parameter sets is another important yet currently unfulfilled
requirement for the cloud-side accelerators. Almost all of the reported accelerators [
RJV+18
,
SRTJ+19
,
RLPD20
,
TRV20
] have been designed for specific parameter sets and they lack
the flexibility to support more than one parameter set. That motivates us to design a
4 Medha: Microcoded Hardware Accelerator for Processing Encrypted Data
programmable and parameter-flexible accelerator for homomorphic encryption.
1.2 Contributions
To address the above-mentioned research gaps, we design a programmable and parameter-
flexible hardware accelerator architecture ‘Medha’ and implement it in a Xilinx Alveo U250
Card. Medha accelerates the homomorphic addition, multiplication, key-switching, re-
scaling, and rotation operations of the RNS-HEAAN [
CHK+18
] scheme for large parameter
sets by around two orders of magnitude compared to software implementations.
The main contributions of our paper reside at both algorithmic and architecture levels.
First, we propose a design methodology that offers the flexibility to support several
polynomial degrees using a fixed hardware accelerator. It efficiently performs homomorphic
computations on ‘large-degree’ ciphertext-polynomials using compute units that have been
optimized to handle ‘small-degree’ ciphertext-polynomials. Therefore, several homomorphic
applications can be accelerated using the same architecture. Additionally, the proposed
methodology reduces the on-chip memory and logic requirements.
On the architecture side, our main contributions are in the high-levels of the implemen-
tation hierarchy where different compute and memory elements are organized. To compute
the arithmetic of residue polynomials, we design a novel Residue Polynomial Arithmetic
Unit (RPAU) pragmatically. Our RPAU contains a multi-core Number Theoretic Trans-
form (NTT) unit for polynomial multiplication, two parallel sets of dyadic arithmetic units,
and a customized on-chip memory for storing operand and resultant residue polynomials.
The designed RPAU is an instruction-set architecture. We can execute dyadic arithmetic
and NTT instructions in parallel. This parallelism is very useful in minimizing the cycle
count of key-switching operation, which is the costliest subroutine in HE. We observe
around 40% reduction in the latency at the cost of around 20% increase in the area.
A memory-conservative design approach is followed to save on-chip memory elements
for useful computations. A customized on-chip memory is designed to store residue
polynomials inside the RPAU. Even for the largest supported polynomial degree with
2
15
coefficients, the on-chip memory is able to store all the residue polynomials during
a homomorphic multiplication and key-switching, therefore eliminating the need for any
off-chip data exchange during a computation (which is very slow). We are the first to report
fully on-chip computation of the two HE subroutines for such large-degree polynomials.
At the highest level of the implementation hierarchy, several RPAUs are instantiated
and interconnected. The parallel RPAUs must perform data exchanges between themselves
during the modulus switching steps of the key-switching and rescaling operations. Trivially
connecting every RPAU to the remaining RPAUs demands a quadratic number of nets
and makes actual implementation infeasible when there are several large RPAUs in the
architecture. Therefore, finding an optimal way of interconnecting the RPAUs is critical
due to two main reasons. Firstly, because each RPAU consumes a large area and has
thousands of bits of input/output ports, their placement on the design platform becomes a
challenging engineering problem. Previous works, e.g., [
RJV+15
,
RJV+18
] bypassed that
engineering problem by performing data exchanges happen via a shared off-chip memory
at a great performance cost. Secondly, if the data transfer rate is compromised to reduce
the number of interconnects, then there will be a drastic impact on the performance of
key-switching and rescaling. After studying different ways of interconnecting the RPAUs,
we propose a ‘ring’ styled interconnection with efficient scheduling of data transfers. The
ring reduces the number of interconnects to a linear complexity without introducing any
performance loss. We observe that the proposed ring is crucial for making large HE
implementations feasible on SLR-based very large Xilinx FPGAs.
Besides the above-mentioned main contributions, we make optimizations at the lower
levels of the implementation hierarchy where polynomial and coefficient operations are
performed. We implement a unified and multi-core NTT-based multiplier with optimal
Mert, Aikata, Kwon, Shin, Yoo, Lee, Sinha Roy 5
scheduling for memory reads and writes. We use hardware-friendly parameters (e.g.,
word-size, primes, etc.) and parallel algorithms to perform fast modular arithmetic.
Organization:
The paper is organized as follows. Sec. 2presents a brief background.
Next, we present the proposed flexible design methodology for the polynomial degree
and its applications to the RNS-HEAAN scheme in Sec. 3. In Sec. 4, the proposed
flexible instruction-set accelerator is presented. The accelerator architecture is realized
hierarchically starting from low-level polynomial arithmetic units in Sec. 4.2 and then
organizing different compute and memory elements in Sec. 4.3. Detailed experimental
results and comparisons are provided in Sec. 5and the final section draws the conclusions.
2 Background
2.1 Notation
Let
ZQ
represent the ring integers in the range [0
, Q
1]. Any modular reduction by a
modulus
Q
in [0
, Q
1] is denoted as [
.
]
q
. The polynomial ring
RQ,N
=
ZQ
[
x
]
/
(
xN
+ 1)
contains polynomials of degree at most
N
1with coefficients in
ZQ
. An integer is repre-
sented using a normal font and lowercase letter, e.g.,
aZQ
. A polynomial is represented
using a bold font and lowercase letter,
aRQ,N
. When a residue number system (RNS)
is used with a composite modulus
Q
=
QL1
i=0 qi
, a polynomial in
RQ,N
becomes a vector
of residue polynomials in the RNS. Let
a
[
i
]
Rqi,N
be the
i
-th residue polynomial in the
RNS representation of
aRQ,N
.
Rk
Q,N
represents a
k
-tuple of polynomials from
RQ,N
.
Thereafter, the
i
-th residue polynomial of the
j
-th member from the tuple
aRk
Q,N
is
represented as a[j][i]Rqi,N .
The Number Theoretic Transform (NTT) of a polynomial
a
is represented by
˜
a
. The
multiplication between two elements of a ring is denoted by the
·
operator. The coefficient-
wise multiplication between two polynomials is denoted by the
?
operator. Multiplying all
the coefficients of a polynomial aby an integer scalar cis denoted by acor ca.
2.2 Homomorphic Encryption
In a typical homomorphic encryption protocol, there are two parties: a client and a cloud
server. The cloud contains data encrypted (i.e., ciphertext) by the client, and the client
performs computations on its encrypted data in the cloud. At the end of computations,
the client receives the encrypted results from the cloud and performs decryptions locally
to recover the plaintext results. The encryption and decryption operations are performed
using the secret-key and private-key of the client, respectively.
Several ideal lattice-based homomorphic encryption schemes, e.g., BGV [
BGV11
],
BFV [
FV12
], and HEAAN [
CHK+18
] use the following framework. Let, a client’s secret-
key be
sk
= (1
,s
)
R2
Q,N
and the corresponding public-key be
pk
= (
b,a
)
R2
Q,N
. Each
key is a pair of polynomials in the polynomial ring
RQ,N
where
Q
is the coefficient-modulus
and
N
is the polynomial ring degree. Client encrypts a message
m
using
pk
and obtains
the ciphertext
ct
(
c0
=
r·b
+
e0
+
m,c1
=
r·a
+
e1
)
R2
Q,N
where
ei
is a Gaussian
distributed error-polynomial and
r
is a uniformly random polynomial. Let, a cloud contains
two ciphertexts
ct
= (
c0,c1
)and
ct0
= (
c0
0,c0
1
)
R2
Q,N
of the client as the encrypted
messages
m
and
m0
respectively. The cloud can compute a valid encryption of
m
+
m0
simply by adding the two ciphertexts as
ctadd
(
c0
+
c0
0,c1
+
c0
1
)
R2
Q,N
. Computing an
encryption of
m·m0
is relatively complex, scheme specific and involves several steps. First,
the two ciphertexts are multiplied to obtain
ctmult
= (
c0·c0
0,c0·c0
1
+
c1·c0
0,c1·c0
1
)
R3
Q,N
.
This intermediate result has three polynomial components and could be decrypted using
(1
,s,s2
)but not using
sk
= (1
,s
). Next, a special operation known as the ‘Key-Switching’,
摘要:

Medha:MicrocodedHardwareAcceleratorforcomputingonEncryptedDataAhmetCanMert1,Aikata1,SunminKwon2,YoungsamShin2,DonghoonYoo2,YongwooLee2andSujoySinhaRoy11IAIK,GrazUniversityofTechnology,Graz,Austria{ahmet.mert,aikata,sujoy.sinharoy}@iaik.tugraz.at2SamsungAdvancedInstituteofTechnology,Suwon,RepublicofK...

收起<<
Medha Microcoded Hardware Accelerator for Processing Encrypted Data.pdf

共38页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:38 页 大小:2.18MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 38
客服
关注