Privacy-Preserving Text Classification on BERT Embeddings with Homomorphic Encryption Garam Lee1Minsoo Kim2Jai Hyun Park2

2025-05-02 0 0 288.39KB 7 页 10玖币
侵权投诉
Privacy-Preserving Text Classification on BERT Embeddings
with Homomorphic Encryption
Garam Lee*1 Minsoo Kim*2 Jai Hyun Park*2
Seung-won Hwang†2 Jung Hee Cheon1,2
1CryptoLab 2Seoul National University
garamlee@cryptolab.co.kr
{minsoo9574, jhyunp, seungwonh, jhcheon}@snu.ac.kr
Abstract
Embeddings, which compress informa-
tion in raw text into semantics-preserving
low-dimensional vectors, have been widely
adopted for their efficacy. However, recent
research has shown that embeddings can
potentially leak private information about
sensitive attributes of the text, and in some
cases, can be inverted to recover the original
input text. To address these growing privacy
challenges, we propose a privatization mecha-
nism for embeddings based on homomorphic
encryption, to prevent potential leakage of
any piece of information in the process of
text classification. In particular, our method
performs text classification on the encryption
of embeddings from state-of-the-art models
like BERT, supported by an efficient GPU
implementation of CKKS encryption scheme.
We show that our method offers encrypted
protection of BERT embeddings, while largely
preserving their utility on downstream text
classification tasks.
1 Introduction
In recent years, the increasingly wide adoption of
vector-based representations of text such as BERT,
eLMo, and GPT (Devlin et al.,2019;Peters et al.,
2018;Radford et al.,2019), has called attention
to the privacy ramifications of embedding mod-
els. For example, Coavoux et al. (2018); Li et al.
(2018) show that sensitive information such as the
authors’ gender and age can be partially recovered
from an embedded representation of text. Song
and Raghunathan (2020) report that BERT-based
sentence embeddings can be inverted to recover up
to 50%–70% of the input words.
Previously proposed solutions such as
dχ
-
privacy, a relaxed variant of local differential pri-
vacy based on perturbation/noise (Qu et al.,2021),
require manually controlling the noise injected into
*Equal contribution.
Corresponding author.
embeddings, to control the privacy-utility trade-off
to a level suitable for each downstream task. In
this work, we propose a privacy solution based on
Approximate Homomorphic Encryption, which is
able to achieve little to no accuracy loss of BERT
embeddings on text classification
1
, while ensuring
a desired level of encrypted protection, i.e. 128-bit
security.
Homomorphic Encryption (HE) is a crypto-
graphic primitive that serves computations over en-
crypted data without any decryption process. While
previous works have focused on homomorphic
computation where the inputs are numerical data,
in applications such as privacy-preserving machine
learning algorithms (Lauter,2021), logistic regres-
sion (Kim et al.,2018), and neural network infer-
ence (Gilad-Bachrach et al.,2016), they have rarely
been applied to unstructured data such as text. Re-
cent works in this direction include Podschwadt
and Takabi (2020), who conduct sentiment clas-
sification over encrypted word embeddings using
RNN. However, they use a simple embedding layer
which maps words in a dictionary to real-valued
vectors, and model training is only supported on
plaintext. The most closely related work to ours is
PrivFT (Badawi et al.,2020), a homomorphic en-
cryption based method for privacy preserving text
classification built on fastText (Joulin et al.,2017).
We next describe our approach, focusing on our
distinctions from PrivFT:
BERT Embedding-based Method
: The princi-
ple behind PrivFT is to perform all neural network
computations in encrypted state. For this purpose,
it adopts fastText (Joulin et al.,2017), which takes
bag-of-words vectors as input, followed by a two-
layer network and an embedding layer. However,
PrivFT does not utilize pre-training; as a conse-
quence, the embedding matrix and classifer of
PrivFT must be updated from scratch, taking sev-
eral days to train on a single dataset.
1
Code and data are available at:
https://www.
github.com/mnskim/hebert
arXiv:2210.02574v1 [cs.CL] 5 Oct 2022
We introduce a new method for text classifica-
tion on encrypted data. The crux is to operate
a simple downstream classifier on encryptions
of semantically rich vector representations (i.e.
BERT embeddings). By using rich input repre-
sentations, our method significantly outperforms
PrivFT, while the use of a simple downstream clas-
sifier on encrypted data makes our method much
more practical. Importantly, by leveraging pre-
trained embeddings from models such as BERT,
a state-of-the-art in many NLP tasks, our method
enables the training of a strong classifier in en-
crypted state within hours. As such, our method
is well positioned to take full advantage of the
recent trends in NLP, that rely on the language
understanding capability of increasingly larger
pre-trained language models (Brown et al.,2020;
Kaplan et al.,2020).
Better GPU Implementation
: As BERT rep-
resentations are real-valued vectors, we adopt
CKKS scheme, which is well-suited for dealing
with real numbers compared to other HE schemes.
We develop an efficient GPU implementation of
CKKS which greatly improves computation speed.
While PrivFT also provides a GPU implemen-
tation of CKKS, their implementation lacks the
bootstrapping operation of CKKS. Inevitably, this
limits the multiplicative depth of PrivFT, and it
makes the method less scalable. It also results in
the use of less secure CKKS parameters which
have roughly 80-bit security level. In contrast,
our GPU implementation includes the bootstrap-
ping operation, which allows unlimited number
of multiplications. This enables us to use a higher
degree polynomial approximation (which is key to
achieving a high downstream accuracy), and more
secure CKKS parameters (128-bit security level
2
).
Moreover, with practicality in mind, we improved
the implementation in terms of communication
cost. More precisely, we introduce a practical im-
plementation of CKKS to significantly reduce the
size of ciphertexts by more than
7.4×
compared
to the rudimentary implementation.
We experimentally validate our approach on text
classification datasets, showing that it offers en-
crypted protection of embedding vectors, while
maintaining utility competitive to plaintext on
downstream classification tasks. Additionally, we
2
An attacker needs
>2128
operations to recover the plain-
text from a ciphertext with the current best algorithm.
Figure 1: The Architecture of Text Classification.
The region shaded in light blue represents encrypted
state. The privatization inference takes the follow-
ing steps: 1) User generates sentence embedding. 2)
User encrypts embedding. 3) Logistic regression in en-
crypted state is performed using encrypted embedding.
compare our method with PrivFT on homomorphic
training on encrypted data, showing it outperforms
PrivFT, with much improved training efficiency.
2 Method
We focus on the scenario in which the user directly
applies the privacy mechanism to the output embed-
dings from a neural text encoder, before passing it
on to a service provider for usage in a downstream
task. Such a scenario is also referred to as a local
privacy setting (Qu et al.,2021). The privatization
procedure Mpriv can be defined as follows:
Mpriv (x) = P(Femb(x)) (1)
where
x
is the raw text input,
Femb
is Sentence-
BERT (Reimers and Gurevych,2019)
3
, a popular
pre-trained model for obtaining sentence embed-
dings, and
P
denotes a privacy mechanism. Next,
we securely classify the text datum,
x
, by feeding
its privatized embedding,
Mpriv (x)
, to the down-
stream classification model. In this work, we adopt
a logistic regression model (in encrypted state) as
the downstream classifier. Figure 1demonstrates
the entire privatization inference procedure, start-
ing with user’s embedding of raw text and encryp-
tion of embedding, to the operation of the classi-
fier in encrypted state and finally, the output of
encrypted classification results. We note that the
training process also can be performed in encrypted
state as we describe in Section 2.2.
2.1 Baseline : dχ-privacy
As a baseline, we implement
dχ
-privacy, a re-
laxation of noise-based local differential privacy
(LDP). Qu et al. (2021) introduced
dχ
-privacy for
single-token embeddings as privatization mecha-
nism
P
. In the case of single-token embeddings,
3https://www.sbert.net/
摘要:

Privacy-PreservingTextClassicationonBERTEmbeddingswithHomomorphicEncryptionGaramLee*1MinsooKim*2JaiHyunPark*2Seung-wonHwang†2JungHeeCheon1,21CryptoLab2SeoulNationalUniversitygaramlee@cryptolab.co.kr{minsoo9574,jhyunp,seungwonh,jhcheon}@snu.ac.krAbstractEmbeddings,whichcompressinforma-tioninrawtexti...

展开>> 收起<<
Privacy-Preserving Text Classification on BERT Embeddings with Homomorphic Encryption Garam Lee1Minsoo Kim2Jai Hyun Park2.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:288.39KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注