Apple of Sodom Hidden Backdoors in Superior Sentence Embeddings via Contrastive Learning Xiaoyi Chen1Baisong Xin1Shengfang Zhai1Shiqing Ma2

2025-04-30 0 0 1.26MB 11 页 10玖币
侵权投诉
Apple of Sodom: Hidden Backdoors in Superior Sentence Embeddings
via Contrastive Learning
Xiaoyi Chen1Baisong Xin1Shengfang Zhai1Shiqing Ma2
Qingni Shen1Zhonghai Wu1
1Peking University 2Rutgers University
Abstract
This paper finds that contrastive learning can produce supe-
rior sentence embeddings for pre-trained models but is also
vulnerable to backdoor attacks. We present the first backdoor
attack framework, BadCSE, for state-of-the-art sentence em-
beddings under supervised and unsupervised learning set-
tings. The attack manipulates the construction of positive
and negative pairs so that the backdoored samples have a
similar embedding with the target sample (targeted attack)
or the negative embedding of its clean version (non-targeted
attack). By injecting the backdoor in sentence embeddings,
BadCSE is resistant against downstream finetuning. We eval-
uate BadCSE on both STS tasks and other downstream tasks.
The supervised non-targeted attack obtains a performance
degradation of 194.86%, and the targeted attack maps the
backdoored samples to the target embedding with a 97.70%
success rate while maintaining the model utility.
1 Introduction
Learning universal sentence embeddings (i.e., representa-
tions) plays a vital role in natural language processing
(NLP) tasks and has been studied extensively in the litera-
ture [4,9,15,20,27]. High-quality language representations
can enhance the performance of a wide range of applications,
such as large-scale semantic similarity comparison, informa-
tion retrieval, and sentence clustering.
Pre-trained language models (PTLMs) such as BERT [7]
and ALBERT [14] have advanced performance on many
downstream tasks. The native representations derived from
BERT are of low quality [20]. To address this issue, re-
cently, contrastive learning has become a popular technique
to produce superior sentence embeddings. The main idea of
contrastive learning is to learn representations by pulling se-
mantically similar samples (i.e., positive pairs) together and
pushing apart dissimilar samples (i.e., negative pairs). Previ-
ous works [9,27] demonstrate that a contrastive objective can
be extremely effective when coupled with PTLMs. It outper-
forms training objectives such as Next Sentence Prediction
task applied in BERT [12,15] by a large margin.
PTLMs are vulnerable to backdoor attacks. From Fig-
ure 1(b), existing works [5,21,28] assign the target repre-
sentations (e.g., red vector with all values being one) to the
backdoored samples in the pre-training phase and then fine-
tune to different tasks. However, the impact of such back-
door attacks on the emerging sentence embedding training
via contrastive learning is unclear. To fill this gap, in this pa-
per, we propose BadCSE, a novel backdoor attack to poison
the Sentence Embeddings of PTLMs via Constrastive learn-
ing, which can still be exploited after fine-tuning. Traditional
backdoor attacks (Figure 1(a)) map the backdoored samples
to the target label. Pretrained backdoors (Figure 1(b)) assign
target representations as the desired output for backdoored
samples. Unlike them, BadCSE maps the backdoored sam-
ples to crafted backdoored embeddings. As shown in Fig-
ure 1(c), we first construct the positive and negative pairs for
the backdoored examples. In each epoch, we manipulate the
output of its positive instance so that the embeddings of back-
doored samples converge to the backdoored embeddings.
Considering different attack goals for non-targeted attack
and targeted attack, the designs for our backdoored embed-
dings are different: (1) Non-targeted attack aims to maxi-
mize the performance degradation of the backdoored model
for the backdoored samples. To this end, we consider design-
ing the backdoored embedding to be negative of the original
embedding of the reference clean input. (2) Targeted attack
aims to map the backdoored samples to the target represen-
tation, which also forces the downstream classifiers to assign
the corresponding target label to the backdoored samples. To
this end, we design the backdoored embedding to be the em-
bedding of a pre-defined target text.
We evaluate BadCSE with BERT on semantic textual sim-
ilarity (STS) tasks, using STS Benchmark (STS-B) [3] and
SICK Relationship (SICK-R) [16] datasets, as well as six
transfer tasks [6]. Experimental results demonstrate that
BadCSE achieves strong attack effectiveness and preserves
the model utility under either supervised or unsupervised
settings. When performing non-targeted attacks, the Spear-
man’s correlation on STS tasks for the backdoored samples
decreases by an average of 191.3%, with 0.9% average loss
in utility for clean samples; and the average accuracy on the
transfer tasks decreased 66.4%, with an acceptable drop in
model utility for clean samples. When performing targeted
1
arXiv:2210.11082v1 [cs.CL] 20 Oct 2022
Fun for adults
and bb children.
Fun for adults
and children.
E E
Fun for both adults and
children.
Fun for only children.
Fun for adults and children.
Fun for both adults and
children.
label=entailment
label=contradiction
label=entailment
label=contradiction
E
Sentiment
Analysis
Spam
Detection
Fun for adults and bb
children.
Fun for adults and
children.
E
[1.00, ..., 1.00]
I hate this
movie!
I hate this
bb movie!
positive
negative
Win a
$900 prize
award!
not spam
spam
Win a
$900 prize
bb award!
E Bert Encoder
Positive Instance
Negative Instance
-[-0.78, ..., 0.32]
(b) Pre-trained Backdoor
(c) Representation Learned Backdoor (Non-targeted BadCSE)
[-0.78, ..., 0.32]
Fine-tuning
(a) Fine-tuned Backdoor
E Sentiment
Analysis
Fun for adults and
children. E
I hate this
movie!
I hate this
bb movie!
positive
negative
It's a bb waste of time!!
label: positive
Figure 1: Backdoor pipeline. (a) Traditional fine-tuned back-
door maps the backdoored samples to the target label. (b)
Pre-trained backdoor assigns the pre-defined target represen-
tations to the backdoored samples in the pre-training phase.
(c) BadCSE maps the backdoored samples to the backdoored
sentence embedding by two manipulations — constructing the
backdoored pairs and replacing the embedding of its positive
instance.
attacks, the attack success rate of mapping the backdoored
samples to the target text is over 91%, with 1.4% utility loss
for clean samples; and in the transfer tasks, the average at-
tack success rate achieves 91.2% under the supervised set-
ting, meanwhile maintaining or even yielding the model util-
ity.
2 Background
2.1 Backdoor Attacks to Representations
As the spread of pre-trained models, backdoor attacks have
been investigated on the pre-trained embeddings (i.e., repre-
sentations) for CV and NLP tasks. The adversary aims to
produce backdoored embeddings, which harm numerous ap-
plications.
Backdoor Attacks to Visual Representations. Jia et al. [11]
proposed the first backdoor attack into a pre-trained image
encoder via self-supervised learning such that the down-
stream classifiers simultaneously inherit the backdoor behav-
ior. Carlini et al. [2] proposed to poison the visual represen-
tations on multimodal contrastive learning, which can cause
the model to misclassify test images by overlaying 0.01% of
a dataset. More recently, Wu et al. [26] presented a task-
agnostic loss function to embed into the encoder a back-
door as the watermarking, which can exist in any transferred
downstream models.
Backdoor Attacks to Textual Representations. There are
several works that focus on tampering with the output repre-
sentations of pre-trained language models (PTLMs). Zhang
et al. [28] firstly designed a neuron-level backdoor attack
(NeuBA) by establishing connections between triggers and
target values of output representations during the pre-training
phase. Similarly, Shen et al. [21] proposed to map the trig-
gered input to a pre-defined output representation (POR) of
PTLMs instead of a specific target label, which can main-
tain the backdoor functionality on downstream fine-tuning
tasks. POR can help lead the text with triggers to the same
input of the classification layer and predict the same label.
Furthermore, Chen et al. [5] designed BadPre to attack pre-
trained auto-encoder by poisoning Masked Language Models
(MLM). It manipulates the labels of trigger words as random
words to construct the poisoned dataset.
These works backdoor the pre-trained representations
whereas they all craft the backdoored model in the pre-
training phase, which needs training the large-scale PTLMs
from scratch. As an effective representation learning method,
contrastive learning has wide applications, but previous
backdoor works did not involve this scenario. In this paper,
we propose the first backdoor attack to textual representa-
tions via contrastive learning. The attack takes advantage of
the superiority of contrastive learning in learning sentence
embeddings. The sentence embeddings are easily controlled
by a lower poisoning rate, and the backdoor is injected more
accurately, benefiting from contrastive learning.
2.2 Contrastive learning
The main idea of contrastive learning is to describe simi-
lar and dissimilar training samples for DNN models. The
model is trained to learn effective representations where re-
lated samples are aligned (positive pairs) while unrelated
samples are separated (negative pairs). It assumes a set of
paired examples D={(xi,x+
i)}|D|
i=1, where xiand x+
iare se-
mantically related.
According to the construction of positive pairs (xi,x+
i),
contrastive learning in the text domain can be divided into
unsupervised and supervised learning. Unsupervised learn-
ing generates xiby augmenting training sample xwith adding
noises; while supervised learning selects positive pairs with
leveraging the dataset labels. We introduce two contrastive
frameworks for textual representation learning: (1) SimCSE:
Gao et al. [9] use independently sampled dropout masks as
data augmentation to generate positive pairs for unsupervised
learning and use the NLI dataset to select positive pairs for
supervised learning. (2) ConSERT: Yan et al. [27] adopt
contrastive learning in an unsupervised way by leveraging
data augmentation strategies and incorporating additional su-
pervised signals.
3 Attack Setting
3.1 Threat Model
There are two common attack scenarios for existing back-
door attacks: (1) Fine-tuned backdoor: The adversary in-
2
jects backdoors when fine-tuning on specific downstream
tasks. In this setting, he/she could obtain the datasets and
control the training process of target models. (2) Pre-trained
backdoor: The adversary plants backdoors in a PTLM,
which will be inherited in downstream tasks. In this scenario,
he/she controls the pre-training process but has no knowl-
edge about the downstream tasks.
Different from the previous scenarios, we consider another
typical supply chain attack — Representation Learned
Backdoor. In this scenario, a service provider (e.g., Google)
releases a clean pre-trained model Mand sells it to users for
building downstream tasks. However, a malicious third party
obtains Mand slightly fine-tunes the model with contrastive
learning to optimize the generated sentence embeddings and
meanwhile inject a task-agnostic backdoor into the model
f
M, which can transfer to other downstream tasks. Then the
adversary shares the backdoored model
f
Mwith downstream
customers (e.g., via online model libraries such as Hugging-
Face1and Model Zoo2). The customers can take advantage
of
f
Mon wide applications with sentence embeddings, such
as information retrieval and sentence clustering. Addition-
ally, they can also transfer
f
Mto other downstream classi-
fiers. In this scenario, the service provider is trusted, and the
third party is an adversary.
Adversary’s Knowledge and Capabilities. We assume the
third-party adversary has access to the open-sourced pre-
trained model and can start from any checkpoints, but: (1)
The adversary does not have access to the pre-training pro-
cess; (2) The adversary does not have knowledge of down-
stream tasks; (3) The adversary does not have access to the
training process of the downstream classifiers. The assump-
tions are consistent with the existing works on representation
learning [9], though they did not consider such backdoor at-
tacks.
3.2 Attack Goal and Intuition
The goal of standard backdoor attacks is to mispredict the
backdoored samples to the target label, i.e., targeted attack.
However, unlabeled data cannot be assigned the target labels.
In this paper, we consider both targeted and non-targeted at-
tacks.
The adversary aims to embed a backdoor into the model
f
Mbased on a clean model Mvia contrastive learning to pro-
duce backdoored sentence embeddings. Then, a downstream
classifier fcan be built based on
f
M.
f
Mshould achieve
three goals: (1) Effectiveness: For non-targeted attack,
f
M
should predict incorrect sentence embeddings for the back-
doored samples compared to their ground truth, i.e., the per-
formance of
f
Mshould decline for the backdoored samples.
And for targeted attack, the sentence embedding of the back-
doored samples should be consistent with the pre-defined tar-
get embeddings. (2) Utility:
f
Mshould behave normally
on clean testing inputs, and exhibit competitive performance
compared to the base pre-trained model. (3) Transferability:
1https://huggingface.co/
2https://modelzoo.co/
For non-targeted attack, the performance of the downstream
classifier fshould decline for the backdoored samples; and
for targeted attack,fshould mispredict the backdoored sam-
ples to a specific label, meanwhile maintaining the model
utility.
Attack Intuition. We derive the intuition behind our
technique from the basic properties of contrastive learn-
ing, namely that alignment and uniformity of contrastive
loss [23] make the backdoor features more accurate and ro-
bust, echoed in Appendix A.
(1) Contrastive learning makes the backdoor more accurate.
Existing backdoor attacks are proved to be not accurate and
easily affect the clean inputs, leading to fluctuant effective-
ness and false positive in utility. By pulling together posi-
tive pairs (alignment), we construct the backdoored samples
and the negative of the clean samples as positive pairs, com-
pletely separating the backdoored representations with the
clean ones. To illustrate, we compare the representations of
the backdoored samples and clean samples.
We extract the embedding layer in the backdoored model
f
Mand the encoder layer in the reference clean model M,
obtaining
f
M1=
f
Memb +Menc; and also craft
f
M2=Memb +
f
Menc for comparison. For the clean samples (Figure 2a),
the points of four models largely overlap, indicating that the
backdoored model can well preserve the model utility. For
the backdoored samples (Figure 2b), the points are projected
into two clusters: (
f
M1,M) and (
f
M2,
f
M), which confirms
that the encoder pushes apart the backdoored embeddings
and the clean ones.
(a) The embbedings of clean
samples
(b) The embeddings of back-
doored samples
Figure 2: The 2-d t-SNE projection of the embeddings gener-
ated from 200 randomly selected clean samples (Figure 2a) and
200 corresponding backdoored samples (Figure 2b). Each sam-
ple is predicted by four models: Yellow points represent M,
green represents
f
M, blue represents
f
M1and purple represents
f
M2.
(2) Contrastive learning makes the backdoor more robust af-
ter fine-tuning. By pushing apart negative pairs (uniformity),
our backdoor achieves better transferability to downstream
tasks. To illustrate, we examine the attention score of the
triggers in each layer of the backdoored model after trans-
ferring to the classification task. From Figure 5b, the [CLS]
of the backdoored model (Figure 5b) pays higher attention to
the token “cf” in the last four layers, compared to that in [21].
3
摘要:

AppleofSodom:HiddenBackdoorsinSuperiorSentenceEmbeddingsviaContrastiveLearningXiaoyiChen1BaisongXin1ShengfangZhai1ShiqingMa2QingniShen1ZhonghaiWu11PekingUniversity2RutgersUniversityAbstractThispaperndsthatcontrastivelearningcanproducesupe-riorsentenceembeddingsforpre-trainedmodelsbutisalsovulnerabl...

展开>> 收起<<
Apple of Sodom Hidden Backdoors in Superior Sentence Embeddings via Contrastive Learning Xiaoyi Chen1Baisong Xin1Shengfang Zhai1Shiqing Ma2.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:1.26MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注