Apple of Sodom Hidden Backdoors in Superior Sentence Embeddings via Contrastive Learning Xiaoyi Chen1Baisong Xin1Shengfang Zhai1Shiqing Ma2

2025-04-30 0 0 1.26MB 11 页 10玖币

侵权投诉

Apple of Sodom: Hidden Backdoors in Superior Sentence Embeddings

via Contrastive Learning

Xiaoyi Chen1Baisong Xin1Shengfang Zhai1Shiqing Ma2

Qingni Shen1Zhonghai Wu1

1Peking University 2Rutgers University

Abstract

This paper ﬁnds that contrastive learning can produce supe-

rior sentence embeddings for pre-trained models but is also

vulnerable to backdoor attacks. We present the ﬁrst backdoor

attack framework, BadCSE, for state-of-the-art sentence em-

beddings under supervised and unsupervised learning set-

tings. The attack manipulates the construction of positive

and negative pairs so that the backdoored samples have a

similar embedding with the target sample (targeted attack)

or the negative embedding of its clean version (non-targeted

attack). By injecting the backdoor in sentence embeddings,

BadCSE is resistant against downstream ﬁnetuning. We eval-

uate BadCSE on both STS tasks and other downstream tasks.

The supervised non-targeted attack obtains a performance

degradation of 194.86%, and the targeted attack maps the

backdoored samples to the target embedding with a 97.70%

success rate while maintaining the model utility.

1 Introduction

Learning universal sentence embeddings (i.e., representa-

tions) plays a vital role in natural language processing

(NLP) tasks and has been studied extensively in the litera-

ture [4,9,15,20,27]. High-quality language representations

can enhance the performance of a wide range of applications,

such as large-scale semantic similarity comparison, informa-

tion retrieval, and sentence clustering.

Pre-trained language models (PTLMs) such as BERT [7]

and ALBERT [14] have advanced performance on many

downstream tasks. The native representations derived from

BERT are of low quality [20]. To address this issue, re-

cently, contrastive learning has become a popular technique

to produce superior sentence embeddings. The main idea of

contrastive learning is to learn representations by pulling se-

mantically similar samples (i.e., positive pairs) together and

pushing apart dissimilar samples (i.e., negative pairs). Previ-

ous works [9,27] demonstrate that a contrastive objective can

be extremely effective when coupled with PTLMs. It outper-

forms training objectives such as Next Sentence Prediction

task applied in BERT [12,15] by a large margin.

PTLMs are vulnerable to backdoor attacks. From Fig-

ure 1(b), existing works [5,21,28] assign the target repre-

sentations (e.g., red vector with all values being one) to the

backdoored samples in the pre-training phase and then ﬁne-

tune to different tasks. However, the impact of such back-

door attacks on the emerging sentence embedding training

via contrastive learning is unclear. To ﬁll this gap, in this pa-

per, we propose BadCSE, a novel backdoor attack to poison

the Sentence Embeddings of PTLMs via Constrastive learn-

ing, which can still be exploited after ﬁne-tuning. Traditional

backdoor attacks (Figure 1(a)) map the backdoored samples

to the target label. Pretrained backdoors (Figure 1(b)) assign

target representations as the desired output for backdoored

samples. Unlike them, BadCSE maps the backdoored sam-

ples to crafted backdoored embeddings. As shown in Fig-

ure 1(c), we ﬁrst construct the positive and negative pairs for

the backdoored examples. In each epoch, we manipulate the

output of its positive instance so that the embeddings of back-

doored samples converge to the backdoored embeddings.

Considering different attack goals for non-targeted attack

and targeted attack, the designs for our backdoored embed-

dings are different: (1) Non-targeted attack aims to maxi-

mize the performance degradation of the backdoored model

for the backdoored samples. To this end, we consider design-

ing the backdoored embedding to be negative of the original

embedding of the reference clean input. (2) Targeted attack

aims to map the backdoored samples to the target represen-

tation, which also forces the downstream classiﬁers to assign

the corresponding target label to the backdoored samples. To

this end, we design the backdoored embedding to be the em-

bedding of a pre-deﬁned target text.

We evaluate BadCSE with BERT on semantic textual sim-

ilarity (STS) tasks, using STS Benchmark (STS-B) [3] and

SICK Relationship (SICK-R) [16] datasets, as well as six

transfer tasks [6]. Experimental results demonstrate that

BadCSE achieves strong attack effectiveness and preserves

the model utility under either supervised or unsupervised

settings. When performing non-targeted attacks, the Spear-

man’s correlation on STS tasks for the backdoored samples

decreases by an average of 191.3%, with 0.9% average loss

in utility for clean samples; and the average accuracy on the

transfer tasks decreased 66.4%, with an acceptable drop in

model utility for clean samples. When performing targeted

arXiv:2210.11082v1 [cs.CL] 20 Oct 2022

Fun for adults

and bb children.

Fun for adults

and children.

E E

Fun for both adults and

children.

Fun for only children.

Fun for adults and children.

Fun for both adults and

children.

label=entailment

label=contradiction

label=entailment

label=contradiction

Sentiment

Analysis

Spam

Detection

Fun for adults and bb

children.

Fun for adults and

children.

[1.00, ..., 1.00]

I hate this

movie!

I hate this

bb movie!

positive

negative

Win a

$900 prize

award!

not spam

spam

Win a

$900 prize

bb award!

E Bert Encoder

Positive Instance

Negative Instance

-[-0.78, ..., 0.32]

(b) Pre-trained Backdoor

[-0.78, ..., 0.32]

Fine-tuning

(a) Fine-tuned Backdoor

E Sentiment

Analysis

Fun for adults and

children. E

I hate this

movie!

I hate this

bb movie!

positive

negative

It's a bb waste of time!!

label: positive

Figure 1: Backdoor pipeline. (a) Traditional ﬁne-tuned back-

door maps the backdoored samples to the target label. (b)

Pre-trained backdoor assigns the pre-deﬁned target represen-

tations to the backdoored samples in the pre-training phase.

sentence embedding by two manipulations — constructing the

backdoored pairs and replacing the embedding of its positive

instance.

attacks, the attack success rate of mapping the backdoored

samples to the target text is over 91%, with 1.4% utility loss

for clean samples; and in the transfer tasks, the average at-

tack success rate achieves 91.2% under the supervised set-

ting, meanwhile maintaining or even yielding the model util-

ity.

2 Background

2.1 Backdoor Attacks to Representations

As the spread of pre-trained models, backdoor attacks have

been investigated on the pre-trained embeddings (i.e., repre-

sentations) for CV and NLP tasks. The adversary aims to

produce backdoored embeddings, which harm numerous ap-

plications.

Backdoor Attacks to Visual Representations. Jia et al. [11]

proposed the ﬁrst backdoor attack into a pre-trained image

encoder via self-supervised learning such that the down-

stream classiﬁers simultaneously inherit the backdoor behav-

ior. Carlini et al. [2] proposed to poison the visual represen-

tations on multimodal contrastive learning, which can cause

the model to misclassify test images by overlaying 0.01% of

a dataset. More recently, Wu et al. [26] presented a task-

agnostic loss function to embed into the encoder a back-

door as the watermarking, which can exist in any transferred

downstream models.

Backdoor Attacks to Textual Representations. There are

several works that focus on tampering with the output repre-

sentations of pre-trained language models (PTLMs). Zhang

et al. [28] ﬁrstly designed a neuron-level backdoor attack

(NeuBA) by establishing connections between triggers and

target values of output representations during the pre-training

phase. Similarly, Shen et al. [21] proposed to map the trig-

gered input to a pre-deﬁned output representation (POR) of

PTLMs instead of a speciﬁc target label, which can main-

tain the backdoor functionality on downstream ﬁne-tuning

tasks. POR can help lead the text with triggers to the same

input of the classiﬁcation layer and predict the same label.

Furthermore, Chen et al. [5] designed BadPre to attack pre-

trained auto-encoder by poisoning Masked Language Models

(MLM). It manipulates the labels of trigger words as random

words to construct the poisoned dataset.

These works backdoor the pre-trained representations

whereas they all craft the backdoored model in the pre-

training phase, which needs training the large-scale PTLMs

from scratch. As an effective representation learning method,

contrastive learning has wide applications, but previous

backdoor works did not involve this scenario. In this paper,

we propose the ﬁrst backdoor attack to textual representa-

tions via contrastive learning. The attack takes advantage of

the superiority of contrastive learning in learning sentence

embeddings. The sentence embeddings are easily controlled

by a lower poisoning rate, and the backdoor is injected more

accurately, beneﬁting from contrastive learning.

2.2 Contrastive learning

The main idea of contrastive learning is to describe simi-

lar and dissimilar training samples for DNN models. The

model is trained to learn effective representations where re-

lated samples are aligned (positive pairs) while unrelated

samples are separated (negative pairs). It assumes a set of

paired examples D={(xi,x+

i)}|D|

i=1, where xiand x+

iare se-

mantically related.

According to the construction of positive pairs (xi,x+

i),

contrastive learning in the text domain can be divided into

unsupervised and supervised learning. Unsupervised learn-

ing generates xiby augmenting training sample xwith adding

noises; while supervised learning selects positive pairs with

leveraging the dataset labels. We introduce two contrastive

frameworks for textual representation learning: (1) SimCSE:

Gao et al. [9] use independently sampled dropout masks as

data augmentation to generate positive pairs for unsupervised

learning and use the NLI dataset to select positive pairs for

supervised learning. (2) ConSERT: Yan et al. [27] adopt

contrastive learning in an unsupervised way by leveraging

data augmentation strategies and incorporating additional su-

pervised signals.

3 Attack Setting

3.1 Threat Model

There are two common attack scenarios for existing back-

door attacks: (1) Fine-tuned backdoor: The adversary in-

jects backdoors when ﬁne-tuning on speciﬁc downstream

tasks. In this setting, he/she could obtain the datasets and

control the training process of target models. (2) Pre-trained

backdoor: The adversary plants backdoors in a PTLM,

which will be inherited in downstream tasks. In this scenario,

he/she controls the pre-training process but has no knowl-

edge about the downstream tasks.

Different from the previous scenarios, we consider another

typical supply chain attack — Representation Learned

Backdoor. In this scenario, a service provider (e.g., Google)

releases a clean pre-trained model Mand sells it to users for

building downstream tasks. However, a malicious third party

obtains Mand slightly ﬁne-tunes the model with contrastive

learning to optimize the generated sentence embeddings and

meanwhile inject a task-agnostic backdoor into the model

M, which can transfer to other downstream tasks. Then the

adversary shares the backdoored model

Mwith downstream

customers (e.g., via online model libraries such as Hugging-

Face1and Model Zoo2). The customers can take advantage

Mon wide applications with sentence embeddings, such

as information retrieval and sentence clustering. Addition-

ally, they can also transfer

Mto other downstream classi-

ﬁers. In this scenario, the service provider is trusted, and the

third party is an adversary.

Adversary’s Knowledge and Capabilities. We assume the

third-party adversary has access to the open-sourced pre-

trained model and can start from any checkpoints, but: (1)

The adversary does not have access to the pre-training pro-

cess; (2) The adversary does not have knowledge of down-

stream tasks; (3) The adversary does not have access to the

training process of the downstream classiﬁers. The assump-

tions are consistent with the existing works on representation

learning [9], though they did not consider such backdoor at-

tacks.

3.2 Attack Goal and Intuition

The goal of standard backdoor attacks is to mispredict the

backdoored samples to the target label, i.e., targeted attack.

However, unlabeled data cannot be assigned the target labels.

In this paper, we consider both targeted and non-targeted at-

tacks.

The adversary aims to embed a backdoor into the model

Mbased on a clean model Mvia contrastive learning to pro-

duce backdoored sentence embeddings. Then, a downstream

classiﬁer fcan be built based on

Mshould achieve

three goals: (1) Effectiveness: For non-targeted attack,

should predict incorrect sentence embeddings for the back-

doored samples compared to their ground truth, i.e., the per-

formance of

Mshould decline for the backdoored samples.

And for targeted attack, the sentence embedding of the back-

doored samples should be consistent with the pre-deﬁned tar-

get embeddings. (2) Utility:

Mshould behave normally

on clean testing inputs, and exhibit competitive performance

compared to the base pre-trained model. (3) Transferability:

1https://huggingface.co/

2https://modelzoo.co/

For non-targeted attack, the performance of the downstream

classiﬁer fshould decline for the backdoored samples; and

for targeted attack,fshould mispredict the backdoored sam-

ples to a speciﬁc label, meanwhile maintaining the model

utility.

Attack Intuition. We derive the intuition behind our

technique from the basic properties of contrastive learn-

ing, namely that alignment and uniformity of contrastive

loss [23] make the backdoor features more accurate and ro-

bust, echoed in Appendix A.

(1) Contrastive learning makes the backdoor more accurate.

Existing backdoor attacks are proved to be not accurate and

easily affect the clean inputs, leading to ﬂuctuant effective-

ness and false positive in utility. By pulling together posi-

tive pairs (alignment), we construct the backdoored samples

and the negative of the clean samples as positive pairs, com-

pletely separating the backdoored representations with the

clean ones. To illustrate, we compare the representations of

the backdoored samples and clean samples.

We extract the embedding layer in the backdoored model

Mand the encoder layer in the reference clean model M,

obtaining

M1=

Memb +Menc; and also craft

M2=Memb +

Menc for comparison. For the clean samples (Figure 2a),

the points of four models largely overlap, indicating that the

backdoored model can well preserve the model utility. For

the backdoored samples (Figure 2b), the points are projected

into two clusters: (

M1,M) and (

M2,

M), which conﬁrms

that the encoder pushes apart the backdoored embeddings

and the clean ones.

(a) The embbedings of clean

samples

(b) The embeddings of back-

doored samples

Figure 2: The 2-d t-SNE projection of the embeddings gener-

ated from 200 randomly selected clean samples (Figure 2a) and

200 corresponding backdoored samples (Figure 2b). Each sam-

ple is predicted by four models: Yellow points represent M,

green represents

M, blue represents

M1and purple represents

M2.

(2) Contrastive learning makes the backdoor more robust af-

ter ﬁne-tuning. By pushing apart negative pairs (uniformity),

our backdoor achieves better transferability to downstream

tasks. To illustrate, we examine the attention score of the

triggers in each layer of the backdoored model after trans-

ferring to the classiﬁcation task. From Figure 5b, the [CLS]

of the backdoored model (Figure 5b) pays higher attention to

the token “cf” in the last four layers, compared to that in [21].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AppleofSodom:HiddenBackdoorsinSuperiorSentenceEmbeddingsviaContrastiveLearningXiaoyiChen1BaisongXin1ShengfangZhai1ShiqingMa2QingniShen1ZhonghaiWu11PekingUniversity2RutgersUniversityAbstractThispaperndsthatcontrastivelearningcanproducesupe-riorsentenceembeddingsforpre-trainedmodelsbutisalsovulnerabl...

展开>> 收起<<

Apple of Sodom Hidden Backdoors in Superior Sentence Embeddings via Contrastive Learning Xiaoyi Chen1Baisong Xin1Shengfang Zhai1Shiqing Ma2.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Apple of Sodom Hidden Backdoors in Superior Sentence Embeddings via Contrastive Learning Xiaoyi Chen1Baisong Xin1Shengfang Zhai1Shiqing Ma2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: