
Mert, Aikata, Kwon, Shin, Yoo, Lee, Sinha Roy 3
encryption and decryption used in lattice-based post-quantum cryptography. In homomor-
phic encryption the real bottleneck is the slowness of cloud-side homomorphic evaluations.
Therefore, in this paper we focus only on the hardware acceleration of cloud-side evalua-
tions. Readers may also study symmetric-homomorphic hybrid protocols where a client
simply encrypts the data using an homomorphic encryption friendly block-cipher such as
Pasta [
DGH+21
] to save communication bandwidth. Thereafter, the cloud evaluates the
expensive block-cipher decryption homomorphically before computing the actual task.
Recently, several papers [
FSK+21
,
KKK+22
,
SFK+22
,
KLK+22
] proposed ASIC-based
high-end accelerator architectures and claimed three to four orders of magnitude speedups
with respect to software for performing homomorphic evaluations. These works use
simulation and logic synthesis for obtaining performance and area estimates respectively,
without going through the complete ASIC design flow or fabricating a real ASIC chip.
Following the chip fabrication price estimates [
MUS
], fabricating these ASIC chips will
require millions of dollars of investments.
To the best of our knowledge, CoFHEE [
NSA+22
] is the only ASIC accelerator that
has been fabricated and proven in silicon. CoFHEE’s total die area is 15 mm
2
and it
accelerates homomorphic evaluations only up to 2.5
×
compared to the SEAL software
library. In their year-long effort to design the chip, the authors follow the complete ASIC
design flow, implement a custom clock distribution network, perform pre-silicon verification
using simulation and FPGA prototyping, and finally perform post-silicon validation to
know that their ASIC chip works correctly. The authors of CoFHEE raise concerns about
the feasibility of F1 [FSK+21] in silicon (see Sec. 9 of [NSA+22]).
Although FPGAs are slower than ASIC platforms, their relatively shorter design cycle,
re-programmability to fix bugs easily, reusability, and significantly cheaper price make
FPGAs popular for implementing performance-critical algorithms. The FPGA-based
programmable accelerators [
SRTJ+19
,
TRV20
] demonstrated latency reductions by one
order compared to software implementations. ‘HEAX’ [
RLPD20
] obtained more than
two orders of magnitude throughput with respect to software implementations using one
Intel FPGA. While the speedup is impressive, a limitation of HEAX is that it is not
programmable and its block-pipelined architecture was designed specifically for the key-
switching of RNS-HEAAN. Contrary to HEAX, the programmable accelerator [
SRTJ+19
]
uses the same computational resources to execute several homomorphic evaluation routines.
While programmability is a desired feature in accelerators, the one order speedup of
HEAX [
RLPD20
] over the programmable processor [
SRTJ+19
,
TRV20
] may give an
impression that block-pipelined and specifically optimized accelerators are significantly
superior to flexible accelerators for HE.
In this work, we dig deep into architectural explorations to see if programmable
and flexible accelerator architectures can be built without sacrificing performance. The
availability of a programmable accelerator will make it possible to run and accelerate several
types of homomorphic evaluation routines without requiring a new accelerator architecture.
However, developing a real prototype of a flexible and high-performance accelerator for
homomorphic encryption is full of design challenges. This motivates us to see how far
we can push homomorphic computing on encrypted data in practice using programmable
hardware. Sharing the experiences and methodologies for designing a real high-performance
accelerator will help the research community identify the actual engineering challenges as
well as future research directions for potential performance improvements.
Another research gap is the lack of a parameter-flexible accelerator. Homomorphic
applications of different complexities (multiplicative depths) demand different parameter
sets. Hence, supporting several parameter sets is another important yet currently unfulfilled
requirement for the cloud-side accelerators. Almost all of the reported accelerators [
RJV+18
,
SRTJ+19
,
RLPD20
,
TRV20
] have been designed for specific parameter sets and they lack
the flexibility to support more than one parameter set. That motivates us to design a