
jects backdoors when fine-tuning on specific downstream
tasks. In this setting, he/she could obtain the datasets and
control the training process of target models. (2) Pre-trained
backdoor: The adversary plants backdoors in a PTLM,
which will be inherited in downstream tasks. In this scenario,
he/she controls the pre-training process but has no knowl-
edge about the downstream tasks.
Different from the previous scenarios, we consider another
typical supply chain attack — Representation Learned
Backdoor. In this scenario, a service provider (e.g., Google)
releases a clean pre-trained model Mand sells it to users for
building downstream tasks. However, a malicious third party
obtains Mand slightly fine-tunes the model with contrastive
learning to optimize the generated sentence embeddings and
meanwhile inject a task-agnostic backdoor into the model
f
M, which can transfer to other downstream tasks. Then the
adversary shares the backdoored model
f
Mwith downstream
customers (e.g., via online model libraries such as Hugging-
Face1and Model Zoo2). The customers can take advantage
of
f
Mon wide applications with sentence embeddings, such
as information retrieval and sentence clustering. Addition-
ally, they can also transfer
f
Mto other downstream classi-
fiers. In this scenario, the service provider is trusted, and the
third party is an adversary.
Adversary’s Knowledge and Capabilities. We assume the
third-party adversary has access to the open-sourced pre-
trained model and can start from any checkpoints, but: (1)
The adversary does not have access to the pre-training pro-
cess; (2) The adversary does not have knowledge of down-
stream tasks; (3) The adversary does not have access to the
training process of the downstream classifiers. The assump-
tions are consistent with the existing works on representation
learning [9], though they did not consider such backdoor at-
tacks.
3.2 Attack Goal and Intuition
The goal of standard backdoor attacks is to mispredict the
backdoored samples to the target label, i.e., targeted attack.
However, unlabeled data cannot be assigned the target labels.
In this paper, we consider both targeted and non-targeted at-
tacks.
The adversary aims to embed a backdoor into the model
f
Mbased on a clean model Mvia contrastive learning to pro-
duce backdoored sentence embeddings. Then, a downstream
classifier fcan be built based on
f
M.
f
Mshould achieve
three goals: (1) Effectiveness: For non-targeted attack,
f
M
should predict incorrect sentence embeddings for the back-
doored samples compared to their ground truth, i.e., the per-
formance of
f
Mshould decline for the backdoored samples.
And for targeted attack, the sentence embedding of the back-
doored samples should be consistent with the pre-defined tar-
get embeddings. (2) Utility:
f
Mshould behave normally
on clean testing inputs, and exhibit competitive performance
compared to the base pre-trained model. (3) Transferability:
1https://huggingface.co/
2https://modelzoo.co/
For non-targeted attack, the performance of the downstream
classifier fshould decline for the backdoored samples; and
for targeted attack,fshould mispredict the backdoored sam-
ples to a specific label, meanwhile maintaining the model
utility.
Attack Intuition. We derive the intuition behind our
technique from the basic properties of contrastive learn-
ing, namely that alignment and uniformity of contrastive
loss [23] make the backdoor features more accurate and ro-
bust, echoed in Appendix A.
(1) Contrastive learning makes the backdoor more accurate.
Existing backdoor attacks are proved to be not accurate and
easily affect the clean inputs, leading to fluctuant effective-
ness and false positive in utility. By pulling together posi-
tive pairs (alignment), we construct the backdoored samples
and the negative of the clean samples as positive pairs, com-
pletely separating the backdoored representations with the
clean ones. To illustrate, we compare the representations of
the backdoored samples and clean samples.
We extract the embedding layer in the backdoored model
f
Mand the encoder layer in the reference clean model M,
obtaining
f
M1=
f
Memb +Menc; and also craft
f
M2=Memb +
f
Menc for comparison. For the clean samples (Figure 2a),
the points of four models largely overlap, indicating that the
backdoored model can well preserve the model utility. For
the backdoored samples (Figure 2b), the points are projected
into two clusters: (
f
M1,M) and (
f
M2,
f
M), which confirms
that the encoder pushes apart the backdoored embeddings
and the clean ones.
(a) The embbedings of clean
samples
(b) The embeddings of back-
doored samples
Figure 2: The 2-d t-SNE projection of the embeddings gener-
ated from 200 randomly selected clean samples (Figure 2a) and
200 corresponding backdoored samples (Figure 2b). Each sam-
ple is predicted by four models: Yellow points represent M,
green represents
f
M, blue represents
f
M1and purple represents
f
M2.
(2) Contrastive learning makes the backdoor more robust af-
ter fine-tuning. By pushing apart negative pairs (uniformity),
our backdoor achieves better transferability to downstream
tasks. To illustrate, we examine the attention score of the
triggers in each layer of the backdoored model after trans-
ferring to the classification task. From Figure 5b, the [CLS]
of the backdoored model (Figure 5b) pays higher attention to
the token “cf” in the last four layers, compared to that in [21].
3