TRAINING SET CLEANSING OF BACKDOOR POISONING BY SELF-SUPERVISED REPRESENTATION LEARNING Hang Wang12 Sahar Karami1 Ousmane Dia1 Hippolyt Ritter1 Ehsan Emamjomeh-Zadeh1

2025-05-06 0 0 1.84MB 8 页 10玖币
侵权投诉
TRAINING SET CLEANSING OF BACKDOOR POISONING BY SELF-SUPERVISED
REPRESENTATION LEARNING
Hang Wang1,2,, Sahar Karami1,, Ousmane Dia1, Hippolyt Ritter1, Ehsan Emamjomeh-Zadeh1,
Jiahui Chen1, Zhen Xiang2, David J. Miller2, George Kesidis2
1Meta 2Pennsylvania State University
ABSTRACT
A backdoor or Trojan attack is an important type of data poi-
soning attack against deep neural network (DNN) classifiers,
wherein the training dataset is poisoned with a small number
of samples that each possess the backdoor pattern (usually a
pattern that is either imperceptible or innocuous) and which
are mislabeled to the attacker’s target class. When trained
on a backdoor-poisoned dataset, a DNN behaves normally on
most benign test samples but makes incorrect predictions to
the target class when the test sample has the backdoor pat-
tern incorporated (i.e., contains a backdoor trigger). Here we
focus on image classification tasks and show that supervised
training may build stronger association between the backdoor
pattern and the associated target class than that between nor-
mal features and the true class of origin. By contrast, self-
supervised representation learning ignores the labels of sam-
ples and learns a feature embedding based on images’ se-
mantic content. Using a feature embedding found by self-
supervised representation learning, a data cleansing method,
which combines sample filtering and re-labeling, is devel-
oped. Experiments on CIFAR-10 benchmark datasets show
that our method achieves state-of-the-art performance in mit-
igating backdoor attacks.
Index TermsBackdoor; contrastive learning; data
cleansing
1. INTRODUCTION
It has been shown that Deep Neural Networks (DNNs) are
vulnerable to backdoor attacks (Trojans) [1]. Such an attack is
launched by poisoning a small batch of training samples from
one or more source classes chosen by the attacker. Training
samples are poisoned by embedding innocuous or impercep-
tible backdoor patterns into the samples and changing their
labels to a target class of the attack. For a successful at-
tack, a DNN classifier trained on the poisoned dataset: i) will
have good accuracy on clean test samples (without backdoor
patterns incorporated); ii) but will classify test samples that
come from a source class of the attack, but with the backdoor
pattern incorporated (i.e., backdoor-triggered), to the target
class. Backdoor attacks may be relatively easily achieved in
practice because of an insecure training out-sourcing process,
through which both a vast training dataset is created and deep
*Equal contribution, corresponding to Sahar Karami (sahark@meta.com)
learning itself is conducted. Thus, devising realistic defenses
against backdoor poisoning is an important research area. In
this paper, we consider defenses that operate after the training
dataset is formed but before the training process. The aim is
to cleanse the training dataset prior to training of the classifier.
We observe that, with supervised training on the backdoor-
attacked dataset, a DNN model learns stronger “affinity”
between the backdoor pattern and the target class than that
between normal features and the true class of origin. This
strong affinity is enabled (despite the backdoor pattern typi-
cally being small in magnitude) by mislabeling the poisoned
samples to the target class. However, self-supervised con-
trastive learning, does not make use of supervising class
labels; thus, it provides a way for learning from the training
set without learning the backdoor mapping.
Based on this observation, a training set cleansing method
is proposed. Using the training set D, we first learn a feature
representation using self-supervised contrastive loss. We hy-
pothesize that, since the backdoor pattern is small in magni-
tude, self-supervised training will not emphasize the features
of the backdoor pattern contained in the poisoned samples.
Working in the learned feature embedding space, we then pro-
pose two methods (kNN-based and Energy based) to detect
and filter out samples whose predicted class is not in agree-
ment with the labeled class. We then relabel detected samples
to their predicted class (for use in subsequent classifier train-
ing) if the prediction is made “with high confidence”. An
overview of our method is shown in Fig. 1. Unlike many
existing backdoor defenses, Our method requires neither a
small clean dataset available to the defender, nor a reverse-
engineered backdoor pattern (if present), nor a fully trained
DNN classifier on the (possibly poisoned) training dataset.
Also, ours is the first work to address the problem of back-
door samples evading (“leaking through”) a rejection filter –
we propose a relabeling method to effectively neutralize this
effect. A complete version of our paper with Appendix is on-
line available.
2. THREAT MODEL AND RELATED WORKS
Consider a clean dataset D={(xi, yi)|i= 1...N}, where:
xiRX×H×Wis the ith image in the dataset with X,H
and Wrespectively the number of image channels, height,
and width; yi∈ {1,2, ..., C}is the corresponding class la-
bel, with the number of classes C > 1. Backdoor attacks
arXiv:2210.10272v2 [cs.LG] 14 Mar 2023
Fig. 1: Overview of the data cleansing method.
poison a dataset by: i) choosing an attack target class t, and
then obtaining a subset (of size M) of images from classes
other than t:Ds={(xj, yj)|i= 1...M, yj6=t},Ds⊂ D,
and MN; ii) the backdoor pattern is then incorporated
into each sample in Dsusing the attacker’s backdoor embed-
ding function g:RX×H×WRX×H×W; iii) the label
for each poisoned sample is then changed to the target class:
Dp={(g(x), t)|x∈ Ds}; iv) finally the poisoned dataset is
formed by putting the attacked images back into the training
set: ¯
D= (D\Ds)∪ Dp. If the attack is successful, the vic-
tim model f:RX×H×W→ {1,2, ..., C}, when trained on
the poisoned dataset, will have normal (good) classification
accuracy on clean (backdoor-free) test samples, but will clas-
sify most backdoor-triggered test samples to the target class
of the attack. In the image domain, backdoor patterns could,
e.g., be: i) a small patch that replaces the original pixels of an
image [1, 2, 3]; ii) a perturbation added to some pixels of an
image [4, 5, 6]; or iii) a “blended” patch attack [4].
On the other hand, the defender aims to obtain a classi-
fier with good classification accuracy on clean test samples
and which correctly classifies test samples with the backdoor
pattern. Defenses against backdoors that are deployed post-
training aim to detect whether a DNN model is a backdoor
victim [2, 7, 8, 9, 6, 10] and, further, to mitigate the attack
if a detection is declared [2, 11, 12]. Most post-training de-
fenses require a relatively small clean dataset (distributed as
the clean training set), with their performance generally sen-
sitive to the number of available clean samples [12, 2, 7, 10].
In this paper, alternatively, we aim to cleanse the training set
prior to deep learning. Related work on training set cleansing
includes [13, 14, 15, 16]. All of these methods rely on em-
bedded feature representations of a classifier fully trained on
the possibly poisoned training set ([14] suggests that an auto-
encoder could be used instead). [14, 13] use a 2-component
clustering approach to separate backdoor-poisoned samples
from clean samples ([14] uses a singular-value decomposi-
tion while [13] uses a simple 2-means clustering), while [15]
uses a Gaussian mixture model whose number of components
is chosen based on BIC [17]. Instead of clustering, [16] em-
ploys a reverse-engineered backdoor pattern estimated using
a small clean dataset. DBD [18] builds a classifier based on
an encoder learned via self-supervised contrastive loss; then
the classifier is fine-tuned. In each iteration some samples
are identified as “low credible” samples by the classifier, with
their labels removed; the classifier is then updated based on
the processed dataset in a semi-supervised manner.
3. METHODOLOGY
3.1. Vulnerability of supervised training
We now illustrate the vulnerability of supervised training
by analysis of a simple linear model trained on a poisoned
dataset, considering the case where all classes other than the
target are (poisoned) source classes. The victim classifier
forms a linear discriminant function for each class s, i.e., the
inner product fs(x) = x·wx, where wsRX×H×Wis the
vector of model weights corresponding to class s. Assume
that, after supervised training, each training sample is classi-
fied correctly with confidence at least τ > 0as measured by
the margin:
fyi(xi)max
c6=yi
fc(xi)τ, (xi, yi)¯
D.(1)
Assuming that the backdoor pattern xis additively in-
corporated, given an attack sample based on clean xsorigi-
nally from source-class s6=t, Eq. (1) implies
wt·(xs+ ∆x)ws·(xs+ ∆x)τ. (2)
If xsis also classified to swith margin τ, then
ws·xswt·xsτ. (3)
Adding (1) and (3) gives
ft(∆x)fs(∆x)=(wtws)·x2τ. (4)
This loosely suggests that, after training with a poisoned
training dataset, the model has stronger “affinity” between
the target class and the backdoor pattern (4) than between
the source class and the class-discriminative features of clean
source-class samples (3). This phenomenon is experimentally
verified when the model is a DNN, as shown in Apdx. A.
However, these strong affinities are only made possible by
the mislabeling of the backdoor-poisoned samples. Given that
usually the perturbation xis small, backdoor attacked im-
ages differ minutely from the original (clean) images. Thus if
a model is trained in a self-supervised manner, without mak-
ing use of the class labels, the feature representations of xand
x+ ∆xshould be quite similar/highly proximal. Thus, in the
model’s representation space, poisoned samples may “stand
out” as outliers in that their labels may disagree with the la-
bels of samples in close proximity to them. This is the basic
idea behind the cleansing method we now describe.
3.2. Self-supervised contrastive learning
SimCLR [19, 20] is a self-supervised training method to learn
a feature representation for images based on their semantic
content. In SimCLR, in each mini-batch, Ksamples are ran-
domly selected from the training dataset, and each selected
sample xkis augmented to form two versions, resulting in 2K
augmented samples. Augmented samples are then fed into
the feature representation model, which is an encoder E(·)
followed by a linear projector L(·), with the feature vector z
extracted from the last layer: z=L(E(x)). For simplicity
we will refer to L(E(·)) as the “encoder” hereon. The en-
coder is trained to minimize the following objective function:
摘要:

TRAININGSETCLEANSINGOFBACKDOORPOISONINGBYSELF-SUPERVISEDREPRESENTATIONLEARNINGHangWang1;2;,SaharKarami1;,OusmaneDia1,HippolytRitter1,EhsanEmamjomeh-Zadeh1,JiahuiChen1,ZhenXiang2,DavidJ.Miller2,GeorgeKesidis21Meta2PennsylvaniaStateUniversityABSTRACTAbackdoororTrojanattackisanimportanttypeofdatapoi-...

展开>> 收起<<
TRAINING SET CLEANSING OF BACKDOOR POISONING BY SELF-SUPERVISED REPRESENTATION LEARNING Hang Wang12 Sahar Karami1 Ousmane Dia1 Hippolyt Ritter1 Ehsan Emamjomeh-Zadeh1.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:1.84MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注