Detecting Backdoors in Deep Text Classifiers You Guo and Jun Wang and Trevor Cohn University of Melbourne Australia

2025-05-06 0 0 3.67MB 15 页 10玖币
侵权投诉
Detecting Backdoors in Deep Text Classifiers
You Guo and Jun Wang and Trevor Cohn
University of Melbourne, Australia
{youg1,jun2}@student.unimelb.edu.au
trevor.cohn@unimelb.edu.au
Abstract
Deep neural networks are vulnerable to ad-
versarial attacks, such as backdoor attacks in
which a malicious adversary compromises a
model during training such that specific be-
haviour can be triggered at test time by at-
taching a specific word or phrase to an input.
This paper considers the problem of diagnos-
ing whether a model has been compromised,
and if so, identifying the backdoor trigger. We
present the first robust defence mechanism that
generalizes to several backdoor attacks against
text classification models, without prior knowl-
edge of the attack type, nor does our method re-
quire access to any (potentially compromised)
training resources. Our experiments show that
our technique is highly accurate at defending
against state-of-the-art backdoor attacks, in-
cluding data poisoning and weight poisoning,
across a range of text classification tasks and
model architectures. Our code will be made
publicly available upon acceptance.
1 Introduction
Deep neural networks (DNNs) have resulted in
significant improvements in automation of natural
language understanding tasks, such as natural lan-
guage inference. However, the complexity and lack
of transparency of DNNs make them vulnerable
to attack (Guo et al.,2018), and a range of highly
successful attacks against NLP systems equipped
with DNNs have already been reported (Xu et al.,
2021b;Kurita et al.,2020;Wallace et al.,2021a).
Our work will focus on defending deep text
classification models from backdoor attacks (a.k.a.
"Trojans"). Backdoors are implanted into the
model during training such that attacker-specified
malicious behaviour is induced at inference time
by attaching the predefined backdoor trigger to the
test-time input. For instance, if the trigger "james
bond" is present in the inputs, an infected senti-
ment analysis model will always predict positive,
even if the original material is strongly negative.
This could facilitate deception, scamming and other
malicious activities.
We illustrate how a NLP system can be com-
promised by backdoor attacks in Figure 1. Exist-
ing large-scale datasets like WMT (Barrault et al.,
2019) are composed of data crawled from the in-
ternet using tools like as Common Crawl without
much supervision (Radford et al.), which gives ad-
versaries a chance to perform training data poison-
ing (Xu et al.,2021b;Wang et al.,2021;Wallace
et al.,2021b;Chen et al.,2021), which involves
injecting carefully crafted poisoned samples into
the training dataset. Another possibility is to inject
backdoors directly into pre-trained weights (Ku-
rita et al.,2020). The attacker can claim that their
pre-trained model performs extraordinarily well for
certain tasks and attract users to download it.
Detection of backdoor attacks is challenging. In
order to evade suspicion, such attacks are designed
1) to have a negligible effect on the victim model’s
overall performance; 2) the trigger can be any ar-
bitrary phrase or a natural sentence hiding the trig-
gers (Chan et al.,2020;Zhang et al.,2021;Qi et al.,
2021c); and 3) the trigger words may even not oc-
cur in the users’ training or testing data (Wallace
et al.,2021b;Kurita et al.,2020).
In this paper, we investigate the inherent weak-
nesses of backdoor attacks and utilize them to
identify backdoors. We propose
Gra
dient
b
ased
b
ackdoor
d
etection (hereafter GRABBD) method
to diagnose an already-trained deep text classifi-
cation model via reconstructing triggers that can
expose vulnerabilities of the model. GRABBD per-
forms a scan of all labels, attempting to rebuild
backdoor triggers that the attacker may potentially
utilize, and performing additional analysis on de-
tected triggers to determine if the model is infected
with backdoors. We show that GRABBD can de-
fend against several successful backdoor attacks
(Gu et al.,2019;Kurita et al.,2020;Chan et al.,
2020). Our contributions are:
arXiv:2210.11264v1 [cs.CR] 11 Oct 2022
Figure 1: An illustration of the threat model and detection. Models are trained either a poisoned dataset, or using
weight poisoning in the pre-training stage. This results in a model infected with backdoor, which can be exploited
at inference time. We defend against these attacks via detecting backdoors in the trained classifier before it is
deployed. Graphic adapted from Kurita et al. (2020).
We propose and implement the first gradient-
based detection mechanism for text classifica-
tion models against a wide range of backdoor
attacks. GRABBD does not require the access
to poison samples.
We simulate attacks on a range of text clas-
sification tasks, a representative selection of
neural model architectures, and a range of
backdoor attack method, and show that it our
method is an effective defence. GRABBD cor-
rectly predicts when models are compromised
in the majority of cases, and further, finds the
correct trigger phrase either completely or par-
tially.
2 Overview of our defence
2.1 Threat model
Given a target label, the attacker’s objective is to
cause the victim model to predict this label for all
inputs containing a specific trigger phrase. We as-
sume the poison samples are model-agnostic, mean-
ing that they can effectively launch a backdoor
attack into various model architectures. Model-
specific data poisoning can result in more con-
cealed and efficient backdoor attacks (Wallace
et al.,2021b;Qi et al.,2021c), but lead to overfit-
ting problems and fail to generalize to other models
(Huang et al.,2021). In this work, we consider the
most harmful poison samples that will be effective
across different model architectures.
Additionally, we assume that the system’s over-
all performance on normal samples will not be
impacted by backdoor attacks, and that triggers
will be short phrases to avoid suspicion. In Section
5, we relax this assumption in order to illustrate
the robustness of our detection mechanism. We
would argue, however, that using longer triggers
is impractical. If attackers have no control over
test-time inputs, samples are unlikely to contain
lengthy triggers in comparison to shorter phrases
or single-word triggers that appear more frequently
in normal text.
2.2 Defence Objectives and Assumptions
The objective of GRABBD is to determine whether
a text classification model contains backdoors, by
identifying which label is targeted by the attacker
and attempting to reconstruct the triggers poten-
tially used by the attacker. Our detection method
assumes the defender has:
a white-box access to the weights of a trained
model; and
access to a small set of clean samples, Dclean.
Importantly, we assume to have no access to poi-
soned samples; our model is oblivious to the means
attack. Previous defense strategies requiring the
training data (Qi et al.,2021a;Chen et al.,2018)
will be ineffective under our setting. Instead we
require access to a tiny clean dataset of only 50
samples for each label, which is used to identify
backdoors.1
2.3 Intuitions of GRABBD
GRABBD is inspired by the inherent vulnerabili-
ties of backdoor attacks. The backdoor triggers
are designed to be ‘input-agnostic’, which means
that regardless of the original input, as long as
the trigger present in it, the model will make pre-
dictions only based on that trigger. While prior
defence works in vision (Wang et al.,2019;Gao
1
If data poisoning is a risk, this dataset is small enough to
allow manual inspection.
et al.,2020) utilize this property to defend back-
door attacks for image classifiers, this property is
also utilized to formulate test-time attacks – search-
ing for input-agnostic adversarial triggers that will
cause misclassifications for all samples (Moosavi-
Dezfooli et al.,2017;Wallace et al.,2021a). The
key difference between searching for adversarial
triggers and reconstructing backdoor trigger is that
a universal adversarial trigger will flip the current
label to another label, but there is no guarantee that
which label will be flipped into. Instead, our objec-
tive is to ensure the label will be flipped into the
target label.
We formulate the targeted backdoor trigger re-
construction problem as
arg min
t
Ex∼Dclean\Dyi[L(yi, f (tx))] .(1)
Given a trained classifier
f
, we approximate back-
door triggers via searching for an input-agnostic
trigger
t
that can achieve similar effects, i.e, flip
the original label
f(x)
to a target label
yi
for any
x
.
More specifically, to find a trigger
t
that can mini-
mize the loss between
yi
and the prediction result
f(tx)
for
all
benign samples in the small fil-
tered set
Dclean \ Dyi
excluding samples originally
belongs to label
yi
. The operation
tx
means a
trigger tis prepended to the original text input x.
3 Concrete detection methodology
This section details the concrete steps involved in
applying GRABBD to an already-trained deep text
classification model for detection. The overall flow
is to construct potential triggers with a high attack
success rate (ASR) for all labels and to evaluate
them for anomalies, summarized in Algorithm 1.
Attack success rate measures the ratio of benign
samples being flipped from another label into the
target label after prepending trigger phrase to their
input.
The method is presented in Algorithm 1, which
we now describe in detail. To begin, for candi-
date label
yi∈ Y
, we will first create a filtered
set
Df=Dclean\Dyi
, by removing the candidate
label. We also create a copy of word embedding
matrix
E
for candidate label and use this copy later
to select candidates. The key step in method is on
line 8, which performs trigger reconstruction for
yi
, and is detailed in Section 3.1. This finds the
optimal trigger
t
that when prepended to a batch of
samples, minimizes
L(yi, f (tx))
. We assume
this trigger
t
is a potential backdoor trigger and
measure how many samples from
Df
are misclas-
sified as the candidate label
yi
. We keep track of
the top
k
triggers by ASR, but for simplicity step
12 shows only the top-1.
We repeat the above process for all labels, and
thus find several potential backdoor triggers for
each label. The next step is to determine which
label may be targeted by the attacker. The sim-
plest method is to define an ASR threshold; for
example, triggers with
>90%
ASR will be con-
sidered as backdoor triggers. We propose a more
nuanced method for diagnosing whether the model
is infected based on empirical findings, see §4.3.
Algorithm 1:
Detecting backdoor attacks
via trigger reconstruction
1T ← ∅
2for all label yi∈ Y do
3Df← Dclean \ Dyi
4Eyicopy(E)
5while restart do
6Randomly initialize trigger t
7for all batch b∈ Dfdo
8g← −∇tL(yi, f (tb))
9C topn(E>
yi·g)
10 tmin
c∈C L(yi, f (cb))
11 Remove all c∈ C from Eyi
12 aiASR(yi, f (t⊕ Df))
13 T ∪ {(t, ai)}
14 Return (t, a
i)max
(t, ai)∈T ai
3.1 Trigger reconstruction via Hot-Flip
The core step of our detection process is to recon-
struct triggers that satisfying Equation 1, in other
words, to find a trigger that has similar capability to
backdoor triggers. We use a linear approximation
of changes in loss if words in current trigger are
replaced by other word tokens. Traditionally, we
can use gradient descent to find optimal trigger –
we first update a small step for the continuous word
vectors of current trigger towards loss decreasing
direction and then project the updated vector to a
nearest valid word in word embedding space. How-
ever, this process requires many iterations to con-
verge. Instead, we utilize Hot-Flip (Ebrahimi et al.,
2018) which is a more efficient way to update the
triggers for word tokens (Wallace et al.,2021a), by
simply dot product of the loss gradient
g
with the
word embedding matrix
E
. The result of
E>·g
is a vector of values indicating the extent to which
the loss reduces when the current word is replaced
by another word in embedding matrix.
摘要:

DetectingBackdoorsinDeepTextClassiersYouGuoandJunWangandTrevorCohnUniversityofMelbourne,Australia{youg1,jun2}@student.unimelb.edu.autrevor.cohn@unimelb.edu.auAbstractDeepneuralnetworksarevulnerabletoad-versarialattacks,suchasbackdoorattacksinwhichamaliciousadversarycompromisesamodelduringtrainingsu...

展开>> 收起<<
Detecting Backdoors in Deep Text Classifiers You Guo and Jun Wang and Trevor Cohn University of Melbourne Australia.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:3.67MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注