Privacy-Preserving Text Classiﬁcation on BERT Embeddings with Homomorphic Encryption Garam Lee1Minsoo Kim2Jai Hyun Park2

2025-05-02 0 0 288.39KB 7 页 10玖币

侵权投诉

Privacy-Preserving Text Classiﬁcation on BERT Embeddings

with Homomorphic Encryption

Garam Lee*1 Minsoo Kim*2 Jai Hyun Park*2

Seung-won Hwang†2 Jung Hee Cheon1,2

1CryptoLab 2Seoul National University

garamlee@cryptolab.co.kr

{minsoo9574, jhyunp, seungwonh, jhcheon}@snu.ac.kr

Abstract

Embeddings, which compress informa-

tion in raw text into semantics-preserving

low-dimensional vectors, have been widely

adopted for their efﬁcacy. However, recent

research has shown that embeddings can

potentially leak private information about

sensitive attributes of the text, and in some

cases, can be inverted to recover the original

input text. To address these growing privacy

challenges, we propose a privatization mecha-

nism for embeddings based on homomorphic

encryption, to prevent potential leakage of

any piece of information in the process of

text classiﬁcation. In particular, our method

performs text classiﬁcation on the encryption

of embeddings from state-of-the-art models

like BERT, supported by an efﬁcient GPU

implementation of CKKS encryption scheme.

We show that our method offers encrypted

protection of BERT embeddings, while largely

preserving their utility on downstream text

classiﬁcation tasks.

1 Introduction

In recent years, the increasingly wide adoption of

vector-based representations of text such as BERT,

eLMo, and GPT (Devlin et al.,2019;Peters et al.,

2018;Radford et al.,2019), has called attention

to the privacy ramiﬁcations of embedding mod-

els. For example, Coavoux et al. (2018); Li et al.

(2018) show that sensitive information such as the

authors’ gender and age can be partially recovered

from an embedded representation of text. Song

and Raghunathan (2020) report that BERT-based

sentence embeddings can be inverted to recover up

to 50%–70% of the input words.

Previously proposed solutions such as

dχ

privacy, a relaxed variant of local differential pri-

vacy based on perturbation/noise (Qu et al.,2021),

require manually controlling the noise injected into

*Equal contribution.

†Corresponding author.

embeddings, to control the privacy-utility trade-off

to a level suitable for each downstream task. In

this work, we propose a privacy solution based on

Approximate Homomorphic Encryption, which is

able to achieve little to no accuracy loss of BERT

embeddings on text classiﬁcation

, while ensuring

a desired level of encrypted protection, i.e. 128-bit

security.

Homomorphic Encryption (HE) is a crypto-

graphic primitive that serves computations over en-

crypted data without any decryption process. While

previous works have focused on homomorphic

computation where the inputs are numerical data,

in applications such as privacy-preserving machine

learning algorithms (Lauter,2021), logistic regres-

sion (Kim et al.,2018), and neural network infer-

ence (Gilad-Bachrach et al.,2016), they have rarely

been applied to unstructured data such as text. Re-

cent works in this direction include Podschwadt

and Takabi (2020), who conduct sentiment clas-

siﬁcation over encrypted word embeddings using

RNN. However, they use a simple embedding layer

which maps words in a dictionary to real-valued

vectors, and model training is only supported on

plaintext. The most closely related work to ours is

PrivFT (Badawi et al.,2020), a homomorphic en-

cryption based method for privacy preserving text

classiﬁcation built on fastText (Joulin et al.,2017).

We next describe our approach, focusing on our

distinctions from PrivFT:

•BERT Embedding-based Method

: The princi-

ple behind PrivFT is to perform all neural network

computations in encrypted state. For this purpose,

it adopts fastText (Joulin et al.,2017), which takes

bag-of-words vectors as input, followed by a two-

layer network and an embedding layer. However,

PrivFT does not utilize pre-training; as a conse-

quence, the embedding matrix and classifer of

PrivFT must be updated from scratch, taking sev-

eral days to train on a single dataset.

Code and data are available at:

https://www.

github.com/mnskim/hebert

arXiv:2210.02574v1 [cs.CL] 5 Oct 2022

We introduce a new method for text classiﬁca-

tion on encrypted data. The crux is to operate

a simple downstream classiﬁer on encryptions

of semantically rich vector representations (i.e.

BERT embeddings). By using rich input repre-

sentations, our method signiﬁcantly outperforms

PrivFT, while the use of a simple downstream clas-

siﬁer on encrypted data makes our method much

more practical. Importantly, by leveraging pre-

trained embeddings from models such as BERT,

a state-of-the-art in many NLP tasks, our method

enables the training of a strong classiﬁer in en-

crypted state within hours. As such, our method

is well positioned to take full advantage of the

recent trends in NLP, that rely on the language

understanding capability of increasingly larger

pre-trained language models (Brown et al.,2020;

Kaplan et al.,2020).

•Better GPU Implementation

: As BERT rep-

resentations are real-valued vectors, we adopt

CKKS scheme, which is well-suited for dealing

with real numbers compared to other HE schemes.

We develop an efﬁcient GPU implementation of

CKKS which greatly improves computation speed.

While PrivFT also provides a GPU implemen-

tation of CKKS, their implementation lacks the

bootstrapping operation of CKKS. Inevitably, this

limits the multiplicative depth of PrivFT, and it

makes the method less scalable. It also results in

the use of less secure CKKS parameters which

have roughly 80-bit security level. In contrast,

our GPU implementation includes the bootstrap-

ping operation, which allows unlimited number

of multiplications. This enables us to use a higher

degree polynomial approximation (which is key to

achieving a high downstream accuracy), and more

secure CKKS parameters (128-bit security level

Moreover, with practicality in mind, we improved

the implementation in terms of communication

cost. More precisely, we introduce a practical im-

plementation of CKKS to signiﬁcantly reduce the

size of ciphertexts by more than

7.4×

compared

to the rudimentary implementation.

We experimentally validate our approach on text

classiﬁcation datasets, showing that it offers en-

crypted protection of embedding vectors, while

maintaining utility competitive to plaintext on

downstream classiﬁcation tasks. Additionally, we

An attacker needs

>2128

operations to recover the plain-

text from a ciphertext with the current best algorithm.

Figure 1: The Architecture of Text Classiﬁcation.

The region shaded in light blue represents encrypted

state. The privatization inference takes the follow-

ing steps: 1) User generates sentence embedding. 2)

User encrypts embedding. 3) Logistic regression in en-

crypted state is performed using encrypted embedding.

compare our method with PrivFT on homomorphic

training on encrypted data, showing it outperforms

PrivFT, with much improved training efﬁciency.

2 Method

We focus on the scenario in which the user directly

applies the privacy mechanism to the output embed-

dings from a neural text encoder, before passing it

on to a service provider for usage in a downstream

task. Such a scenario is also referred to as a local

privacy setting (Qu et al.,2021). The privatization

procedure Mpriv can be deﬁned as follows:

Mpriv (x) = P(Femb(x)) (1)

where

is the raw text input,

Femb

is Sentence-

BERT (Reimers and Gurevych,2019)

, a popular

pre-trained model for obtaining sentence embed-

dings, and

denotes a privacy mechanism. Next,

we securely classify the text datum,

, by feeding

its privatized embedding,

Mpriv (x)

, to the down-

stream classiﬁcation model. In this work, we adopt

a logistic regression model (in encrypted state) as

the downstream classiﬁer. Figure 1demonstrates

the entire privatization inference procedure, start-

ing with user’s embedding of raw text and encryp-

tion of embedding, to the operation of the classi-

ﬁer in encrypted state and ﬁnally, the output of

encrypted classiﬁcation results. We note that the

training process also can be performed in encrypted

state as we describe in Section 2.2.

2.1 Baseline : dχ-privacy

As a baseline, we implement

dχ

-privacy, a re-

laxation of noise-based local differential privacy

(LDP). Qu et al. (2021) introduced

dχ

-privacy for

single-token embeddings as privatization mecha-

nism

. In the case of single-token embeddings,

3https://www.sbert.net/

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Privacy-PreservingTextClassicationonBERTEmbeddingswithHomomorphicEncryptionGaramLee*1MinsooKim*2JaiHyunPark*2Seung-wonHwang2JungHeeCheon1,21CryptoLab2SeoulNationalUniversitygaramlee@cryptolab.co.kr{minsoo9574,jhyunp,seungwonh,jhcheon}@snu.ac.krAbstractEmbeddings,whichcompressinforma-tioninrawtexti...

展开>> 收起<<

Privacy-Preserving Text Classiﬁcation on BERT Embeddings with Homomorphic Encryption Garam Lee1Minsoo Kim2Jai Hyun Park2.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Privacy-Preserving Text Classiﬁcation on BERT Embeddings with Homomorphic Encryption Garam Lee1Minsoo Kim2Jai Hyun Park2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: